[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258230#comment-16258230 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-345473160 I think the hash functions we are using are pretty expensive. We don't need super high quality hash functions for this code, they only need to be reasonable but use limited CPU cycles. We're also going to want to add SSE4.2 accelerated versions (since sse4.2 has instrinsics for crc32 hashes) that we select at runtime if the host processor supports it This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258064#comment-16258064 ] ASF GitHub Bot commented on ARROW-1559: --- xhochy commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151837230 ## File path: cpp/src/arrow/compute/kernels/hash.cc ## @@ -0,0 +1,880 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/hash.h" + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/builder.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" +#include "arrow/compute/kernels/util-internal.h" +#include "arrow/util/hash-util.h" + +namespace arrow { +namespace compute { + +namespace { + +// Initially 1024 elements +static constexpr int64_t kInitialHashTableSize = 1 << 10; + +typedef int32_t hash_slot_t; +static constexpr hash_slot_t kHashSlotEmpty = std::numeric_limits::max(); + +// The maximum load factor for the hash table before resizing. +static constexpr double kMaxHashTableLoad = 0.7; + +enum class SIMDMode : char { NOSIMD, SSE4, AVX2 }; + +#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE) \ + if (!KERNEL) { \ +std::stringstream ss; \ +ss << FUNCNAME << " not implemented for " << type->ToString(); \ +return Status::NotImplemented(ss.str()); \ + } + +Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* out) { + auto hash_table = std::make_shared(pool); + + RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size)); + int32_t* slots = reinterpret_cast(hash_table->mutable_data()); + std::fill(slots, slots + size, kHashSlotEmpty); + + *out = hash_table; + return Status::OK(); +} + +// This is a slight design concession -- some hash actions have the possibility +// of failure. Rather than introduce extra error checking into all actions, we +// will raise an internal exception so that only the actions where errors can +// occur will experience the extra overhead +class HashException : public std::exception { + public: + explicit HashException(const std::string& msg, StatusCode code = StatusCode::Invalid) + : msg_(msg), code_(code) {} + + ~HashException() throw() {} + + const char* what() const throw() override; + + StatusCode code() const { return code_; } + + private: + std::string msg_; + StatusCode code_; +}; + +const char* HashException::what() const throw() { return msg_.c_str(); } + +class HashTable { + public: + HashTable(const std::shared_ptr& type, MemoryPool* pool) + : type_(type), +pool_(pool), +initialized_(false), +hash_table_(nullptr), +hash_slots_(nullptr), +hash_table_size_(0), +mod_bitmask_(0) {} + + virtual ~HashTable() {} + + virtual Status Append(const ArrayData& input) = 0; + virtual Status Flush(std::vector* out) = 0; + virtual Status GetDictionary(std::shared_ptr* out) = 0; + + protected: + Status Init(int64_t elements); + + std::shared_ptr type_; + MemoryPool* pool_; + bool initialized_; + + // The hash table contains integer indices that reference the set of observed + // distinct values + std::shared_ptr hash_table_; + hash_slot_t* hash_slots_; + + /// Size of the table. Must be a power of 2. + int64_t hash_table_size_; + + // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j % + // hash_table_size_, but uses far fewer CPU cycles + int64_t mod_bitmask_; +}; + +Status HashTable::Init(int64_t elements) { + DCHECK_EQ(elements, BitUtil::NextPower2(elements)); + RETURN_NOT_OK(NewHashTable(elements, pool_, _table_)); + hash_slots_ = reinterpret_cast (hash_table_->mutable_data()); + hash_table_size_ = elements; + mod_bitmask_ = elements - 1; + initialized_ = true; + return Status::OK(); +} + +template +class HashTableKernel : public HashTable {}; + +// Types of
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257728#comment-16257728 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-345390578 Phew, looks like we are finally in the clear. +1, I will merge when the build passes and then get busy with follow ups This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257138#comment-16257138 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-345279306 Lost access to the Windows machine I had been using for local debugging, will have to set something up today to figure out why this isn't linking This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256010#comment-16256010 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151545209 ## File path: cpp/src/arrow/type_traits.h ## @@ -430,6 +430,17 @@ static inline bool is_binary_like(Type::type type_id) { return false; } +static inline bool is_dictionary(Type::type type_id) { + switch (type_id) { Review comment: nope, fixing This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256008#comment-16256008 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151544074 ## File path: cpp/src/arrow/type.h ## @@ -498,25 +498,34 @@ class ARROW_EXPORT StructType : public NestedType { std::vector GetBufferLayout() const override; }; -class ARROW_EXPORT Decimal128Type : public FixedSizeBinaryType { +class ARROW_EXPORT _DecimalBaseType : public FixedSizeBinaryType { Review comment: Sure, this is only temporary until I can remove usages of `DecimalType` from parquet-cpp This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255589#comment-16255589 ] ASF GitHub Bot commented on ARROW-1559: --- cpcloud commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151464797 ## File path: cpp/src/arrow/compute/kernels/hash.cc ## @@ -0,0 +1,822 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/hash.h" + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/builder.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" +#include "arrow/compute/kernels/util-internal.h" +#include "arrow/util/hash-util.h" + +namespace arrow { +namespace compute { + +namespace { + +// Initially 1024 elements +static constexpr int64_t kInitialHashTableSize = 1 << 10; + +typedef int32_t hash_slot_t; +static constexpr hash_slot_t kHashSlotEmpty = std::numeric_limits::max(); + +// The maximum load factor for the hash table before resizing. +static constexpr double kMaxHashTableLoad = 0.7; + +enum class SIMDMode : char { NOSIMD, SSE4, AVX2 }; + +#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE) \ + if (!KERNEL) { \ +std::stringstream ss; \ +ss << FUNCNAME << " not implemented for " << type->ToString(); \ +return Status::NotImplemented(ss.str()); \ + } + +Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* out) { + auto hash_table = std::make_shared(pool); + + RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size)); + int32_t* slots = reinterpret_cast(hash_table->mutable_data()); + std::fill(slots, slots + size, kHashSlotEmpty); + + *out = hash_table; + return Status::OK(); +} + +// This is a slight design concession -- some hash actions have the possibility +// of failure. Rather than introduce extra error checking into all actions, we +// will raise an internal exception so that only the actions where errors can +// occur will experience the extra overhead +class HashException : public std::exception { + public: + explicit HashException(const std::string& msg, StatusCode code = StatusCode::Invalid) + : msg_(msg), code_(code) {} + + ~HashException() throw() {} + + const char* what() const throw() override; + + StatusCode code() const { return code_; } + + private: + std::string msg_; + StatusCode code_; +}; + +const char* HashException::what() const throw() { return msg_.c_str(); } + +class HashTable { + public: + HashTable(const std::shared_ptr& type, MemoryPool* pool) + : type_(type), +pool_(pool), +initialized_(false), +hash_table_(nullptr), +hash_slots_(nullptr), +hash_table_size_(0), +mod_bitmask_(0) {} + + virtual ~HashTable() {} + + virtual Status Append(const ArrayData& input) = 0; + virtual Status Flush(Datum* out) = 0; + virtual Status GetDictionary(std::shared_ptr* out) = 0; + + protected: + Status Init(int64_t elements); + + std::shared_ptr type_; + MemoryPool* pool_; + bool initialized_; + + // The hash table contains integer indices that reference the set of observed + // distinct values + std::shared_ptr hash_table_; + hash_slot_t* hash_slots_; + + /// Size of the table. Must be a power of 2. + int64_t hash_table_size_; + + /// Size at which we decide to resize + int64_t hash_table_load_threshold_; + + // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j % + // hash_table_size_, but uses far fewer CPU cycles + int64_t mod_bitmask_; +}; + +Status HashTable::Init(int64_t elements) { + DCHECK_EQ(elements, BitUtil::NextPower2(elements)); + RETURN_NOT_OK(NewHashTable(elements, pool_, _table_)); + hash_slots_ = reinterpret_cast (hash_table_->mutable_data()); + hash_table_size_ = elements; + hash_table_load_threshold_ = + static_cast(static_cast(elements) *
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255573#comment-16255573 ] ASF GitHub Bot commented on ARROW-1559: --- cpcloud commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151463494 ## File path: cpp/src/arrow/type.h ## @@ -498,25 +498,34 @@ class ARROW_EXPORT StructType : public NestedType { std::vector GetBufferLayout() const override; }; -class ARROW_EXPORT Decimal128Type : public FixedSizeBinaryType { +class ARROW_EXPORT _DecimalBaseType : public FixedSizeBinaryType { Review comment: I don't think underscores are allowed by the C++ standard. How about `DecimalBaseType`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255572#comment-16255572 ] ASF GitHub Bot commented on ARROW-1559: --- cpcloud commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151463749 ## File path: cpp/src/arrow/type_traits.h ## @@ -430,6 +430,17 @@ static inline bool is_binary_like(Type::type type_id) { return false; } +static inline bool is_dictionary(Type::type type_id) { + switch (type_id) { Review comment: Any reason not to simplify this to `return type_id == Type::DICTIONARY`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1623#comment-1623 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-344976728 I created https://issues.apache.org/jira/browse/ARROW-1828 to take care of the boolean hash table implementation, since this patch has already gotten quite large This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255546#comment-16255546 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151027570 ## File path: cpp/src/arrow/compute/kernels/cast.h ## @@ -60,7 +63,12 @@ Status Cast(FunctionContext* context, const Array& array, const std::shared_ptr& to_type, const CastOptions& options, std::shared_ptr* out); +ARROW_EXPORT +Status Cast(FunctionContext* context, const Datum& value, +const std::shared_ptr& to_type, const CastOptions& options, +Datum* out); + } // namespace compute } // namespace arrow -#endif // ARROW_COMPUTE_CAST_H +#endif // ARROW_COMPUTE_KERNELS_CAST_H Review comment: I probably need to do a once over of Cast and make sure the ChunkedArray form also works This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255535#comment-16255535 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-344973038 This is ready to go modulo getting the MSVC build passing (looks like just compiler warnings remaining). Any more feedback? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255537#comment-16255537 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-344973188 Also let me know if you like the `x.dictionary_encode()` API. I can always rename that to something else This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254708#comment-16254708 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-344807083 So we're about a factor of 2 or so away from `pandas.unique`: ``` In [1]: import pandas as pd import numpy as np import pyarrow as pa ...: In [2]: arr = pa.array(list(range(1)) * 100) In [3]: v = arr.to_pandas().copy() In [4]: timeit arr.unique() pd.unique(100 loops, best of 3: 10.4 ms per loop In [5]: timeit pd.unique(v) 100 loops, best of 3: 4.82 ms per loop ``` I suspect this is because of our hash function being slow, but we'll need to do some more analysis This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254375#comment-16254375 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151275822 ## File path: cpp/src/arrow/compute/kernels/hash.cc ## @@ -0,0 +1,880 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/hash.h" + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/builder.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" +#include "arrow/compute/kernels/util-internal.h" +#include "arrow/util/hash-util.h" + +namespace arrow { +namespace compute { + +namespace { + +// Initially 1024 elements +static constexpr int64_t kInitialHashTableSize = 1 << 10; + +typedef int32_t hash_slot_t; +static constexpr hash_slot_t kHashSlotEmpty = std::numeric_limits::max(); + +// The maximum load factor for the hash table before resizing. +static constexpr double kMaxHashTableLoad = 0.7; + +enum class SIMDMode : char { NOSIMD, SSE4, AVX2 }; + +#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE) \ + if (!KERNEL) { \ +std::stringstream ss; \ +ss << FUNCNAME << " not implemented for " << type->ToString(); \ +return Status::NotImplemented(ss.str()); \ + } + +Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* out) { + auto hash_table = std::make_shared(pool); + + RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size)); + int32_t* slots = reinterpret_cast(hash_table->mutable_data()); + std::fill(slots, slots + size, kHashSlotEmpty); + + *out = hash_table; + return Status::OK(); +} + +// This is a slight design concession -- some hash actions have the possibility +// of failure. Rather than introduce extra error checking into all actions, we +// will raise an internal exception so that only the actions where errors can +// occur will experience the extra overhead +class HashException : public std::exception { + public: + explicit HashException(const std::string& msg, StatusCode code = StatusCode::Invalid) + : msg_(msg), code_(code) {} + + ~HashException() throw() {} + + const char* what() const throw() override; + + StatusCode code() const { return code_; } + + private: + std::string msg_; + StatusCode code_; +}; + +const char* HashException::what() const throw() { return msg_.c_str(); } + +class HashTable { + public: + HashTable(const std::shared_ptr& type, MemoryPool* pool) + : type_(type), +pool_(pool), +initialized_(false), +hash_table_(nullptr), +hash_slots_(nullptr), +hash_table_size_(0), +mod_bitmask_(0) {} + + virtual ~HashTable() {} + + virtual Status Append(const ArrayData& input) = 0; + virtual Status Flush(std::vector* out) = 0; + virtual Status GetDictionary(std::shared_ptr* out) = 0; + + protected: + Status Init(int64_t elements); + + std::shared_ptr type_; + MemoryPool* pool_; + bool initialized_; + + // The hash table contains integer indices that reference the set of observed + // distinct values + std::shared_ptr hash_table_; + hash_slot_t* hash_slots_; + + /// Size of the table. Must be a power of 2. + int64_t hash_table_size_; + + // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j % + // hash_table_size_, but uses far fewer CPU cycles + int64_t mod_bitmask_; +}; + +Status HashTable::Init(int64_t elements) { + DCHECK_EQ(elements, BitUtil::NextPower2(elements)); + RETURN_NOT_OK(NewHashTable(elements, pool_, _table_)); + hash_slots_ = reinterpret_cast (hash_table_->mutable_data()); + hash_table_size_ = elements; + mod_bitmask_ = elements - 1; + initialized_ = true; + return Status::OK(); +} + +template +class HashTableKernel : public HashTable {}; + +// Types of hash
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254353#comment-16254353 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151272186 ## File path: cpp/src/arrow/compute/kernels/hash.cc ## @@ -0,0 +1,880 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/hash.h" + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/builder.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" +#include "arrow/compute/kernels/util-internal.h" +#include "arrow/util/hash-util.h" + +namespace arrow { +namespace compute { + +namespace { + +// Initially 1024 elements +static constexpr int64_t kInitialHashTableSize = 1 << 10; + +typedef int32_t hash_slot_t; +static constexpr hash_slot_t kHashSlotEmpty = std::numeric_limits::max(); + +// The maximum load factor for the hash table before resizing. +static constexpr double kMaxHashTableLoad = 0.7; + +enum class SIMDMode : char { NOSIMD, SSE4, AVX2 }; + +#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE) \ + if (!KERNEL) { \ +std::stringstream ss; \ +ss << FUNCNAME << " not implemented for " << type->ToString(); \ +return Status::NotImplemented(ss.str()); \ + } + +Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* out) { + auto hash_table = std::make_shared(pool); + + RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size)); + int32_t* slots = reinterpret_cast(hash_table->mutable_data()); + std::fill(slots, slots + size, kHashSlotEmpty); + + *out = hash_table; + return Status::OK(); +} + +// This is a slight design concession -- some hash actions have the possibility +// of failure. Rather than introduce extra error checking into all actions, we +// will raise an internal exception so that only the actions where errors can +// occur will experience the extra overhead +class HashException : public std::exception { + public: + explicit HashException(const std::string& msg, StatusCode code = StatusCode::Invalid) + : msg_(msg), code_(code) {} + + ~HashException() throw() {} + + const char* what() const throw() override; + + StatusCode code() const { return code_; } + + private: + std::string msg_; + StatusCode code_; +}; + +const char* HashException::what() const throw() { return msg_.c_str(); } + +class HashTable { + public: + HashTable(const std::shared_ptr& type, MemoryPool* pool) + : type_(type), +pool_(pool), +initialized_(false), +hash_table_(nullptr), +hash_slots_(nullptr), +hash_table_size_(0), +mod_bitmask_(0) {} + + virtual ~HashTable() {} + + virtual Status Append(const ArrayData& input) = 0; + virtual Status Flush(std::vector* out) = 0; + virtual Status GetDictionary(std::shared_ptr* out) = 0; + + protected: + Status Init(int64_t elements); + + std::shared_ptr type_; + MemoryPool* pool_; + bool initialized_; + + // The hash table contains integer indices that reference the set of observed + // distinct values + std::shared_ptr hash_table_; + hash_slot_t* hash_slots_; + + /// Size of the table. Must be a power of 2. + int64_t hash_table_size_; + + // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j % + // hash_table_size_, but uses far fewer CPU cycles + int64_t mod_bitmask_; +}; + +Status HashTable::Init(int64_t elements) { + DCHECK_EQ(elements, BitUtil::NextPower2(elements)); + RETURN_NOT_OK(NewHashTable(elements, pool_, _table_)); + hash_slots_ = reinterpret_cast (hash_table_->mutable_data()); + hash_table_size_ = elements; + mod_bitmask_ = elements - 1; + initialized_ = true; + return Status::OK(); +} + +template +class HashTableKernel : public HashTable {}; + +// Types of hash
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254362#comment-16254362 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151273409 ## File path: cpp/src/arrow/compute/kernels/hash.cc ## @@ -0,0 +1,880 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/hash.h" + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/builder.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" +#include "arrow/compute/kernels/util-internal.h" +#include "arrow/util/hash-util.h" + +namespace arrow { +namespace compute { + +namespace { + +// Initially 1024 elements +static constexpr int64_t kInitialHashTableSize = 1 << 10; + +typedef int32_t hash_slot_t; +static constexpr hash_slot_t kHashSlotEmpty = std::numeric_limits::max(); + +// The maximum load factor for the hash table before resizing. +static constexpr double kMaxHashTableLoad = 0.7; + +enum class SIMDMode : char { NOSIMD, SSE4, AVX2 }; + +#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE) \ + if (!KERNEL) { \ +std::stringstream ss; \ +ss << FUNCNAME << " not implemented for " << type->ToString(); \ +return Status::NotImplemented(ss.str()); \ + } + +Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* out) { + auto hash_table = std::make_shared(pool); + + RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size)); + int32_t* slots = reinterpret_cast(hash_table->mutable_data()); + std::fill(slots, slots + size, kHashSlotEmpty); + + *out = hash_table; + return Status::OK(); +} + +// This is a slight design concession -- some hash actions have the possibility +// of failure. Rather than introduce extra error checking into all actions, we +// will raise an internal exception so that only the actions where errors can +// occur will experience the extra overhead +class HashException : public std::exception { + public: + explicit HashException(const std::string& msg, StatusCode code = StatusCode::Invalid) + : msg_(msg), code_(code) {} + + ~HashException() throw() {} + + const char* what() const throw() override; + + StatusCode code() const { return code_; } + + private: + std::string msg_; + StatusCode code_; +}; + +const char* HashException::what() const throw() { return msg_.c_str(); } + +class HashTable { + public: + HashTable(const std::shared_ptr& type, MemoryPool* pool) + : type_(type), +pool_(pool), +initialized_(false), +hash_table_(nullptr), +hash_slots_(nullptr), +hash_table_size_(0), +mod_bitmask_(0) {} + + virtual ~HashTable() {} + + virtual Status Append(const ArrayData& input) = 0; + virtual Status Flush(std::vector* out) = 0; + virtual Status GetDictionary(std::shared_ptr* out) = 0; + + protected: + Status Init(int64_t elements); + + std::shared_ptr type_; + MemoryPool* pool_; + bool initialized_; + + // The hash table contains integer indices that reference the set of observed + // distinct values + std::shared_ptr hash_table_; + hash_slot_t* hash_slots_; + + /// Size of the table. Must be a power of 2. + int64_t hash_table_size_; + + // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j % + // hash_table_size_, but uses far fewer CPU cycles + int64_t mod_bitmask_; +}; + +Status HashTable::Init(int64_t elements) { + DCHECK_EQ(elements, BitUtil::NextPower2(elements)); + RETURN_NOT_OK(NewHashTable(elements, pool_, _table_)); + hash_slots_ = reinterpret_cast (hash_table_->mutable_data()); + hash_table_size_ = elements; + mod_bitmask_ = elements - 1; + initialized_ = true; + return Status::OK(); +} + +template +class HashTableKernel : public HashTable {}; + +// Types of hash
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252913#comment-16252913 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-344480057 Consider it done. The Python tests are segfaulting due to something it seems I broke in the cast code path, so I'll look into that tomorrow and also address the `BooleanType` hashing This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252909#comment-16252909 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151027415 ## File path: cpp/src/arrow/array.cc ## @@ -92,7 +92,7 @@ static inline std::shared_ptr SliceData(const ArrayData& data, int64_ auto new_data = data.ShallowCopy(); new_data->length = length; new_data->offset = offset; - new_data->null_count = kUnknownNullCount; + new_data->null_count = data.null_count != 0 ? kUnknownNullCount : 0; Review comment: This is a helpful optimization -- if you know your null count is already 0, no need to keep thinking it's unknown This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252911#comment-16252911 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151027685 ## File path: cpp/src/arrow/compute/kernels/hash.cc ## @@ -0,0 +1,890 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/kernels/hash.h" + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/builder.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" +#include "arrow/compute/kernels/util-internal.h" +#include "arrow/util/hash-util.h" + +namespace arrow { +namespace compute { + +namespace { + +// Initially 1024 elements +static constexpr int64_t kInitialHashTableSize = 1 << 10; + +typedef int32_t hash_slot_t; +static constexpr hash_slot_t kHashSlotEmpty = std::numeric_limits::max(); + +// The maximum load factor for the hash table before resizing. +static constexpr double kMaxHashTableLoad = 0.7; + +enum class SIMDMode : char { NOSIMD, SSE4, AVX2 }; + +#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE) \ + if (!KERNEL) { \ +std::stringstream ss; \ +ss << FUNCNAME << " not implemented for " << type->ToString(); \ +return Status::NotImplemented(ss.str()); \ + } + +Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* out) { + auto hash_table = std::make_shared(pool); + + RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size)); + int32_t* slots = reinterpret_cast(hash_table->mutable_data()); + std::fill(slots, slots + size, kHashSlotEmpty); + + *out = hash_table; + return Status::OK(); +} + +// This is a slight design concession -- some hash actions have the possibility +// of failure. Rather than introduce extra error checking into all actions, we +// will raise an internal exception so that only the actions where errors can +// occur will experience the extra overhead +class HashException : public std::exception { + public: + explicit HashException(const std::string& msg, StatusCode code = StatusCode::Invalid) + : msg_(msg), code_(code) {} + + ~HashException() throw() {} + + const char* what() const throw() override; + + StatusCode code() const { return code_; } + + private: + std::string msg_; + StatusCode code_; +}; + +const char* HashException::what() const throw() { return msg_.c_str(); } + +// TODO(wesm): we do not need this macro yet + +// #define HASH_THROW_NOT_OK(EXPR) \ +// do { \ +// Status _s = (EXPR); \ +// if (ARROW_PREDICT_FALSE(!_s.ok())) {\ +// throw HashException(_s.message(), _s.code()); \ +// } \ +// } while (false) + +class HashTable { + public: + HashTable(const std::shared_ptr& type, MemoryPool* pool) + : type_(type), +pool_(pool), +initialized_(false), +hash_table_(nullptr), +hash_slots_(nullptr), +hash_table_size_(0), +mod_bitmask_(0) {} + + virtual ~HashTable() {} + + virtual Status Append(const ArrayData& input) = 0; + virtual Status Flush(std::vector* out) = 0; + virtual Status GetDictionary(std::shared_ptr* out) = 0; + + protected: + Status Init(int64_t elements); + + std::shared_ptr type_; + MemoryPool* pool_; + bool initialized_; + + // The hash table contains integer indices that reference the set of observed + // distinct values + std::shared_ptr hash_table_; + hash_slot_t* hash_slots_; + + /// Size of the table. Must be a power of 2. + int64_t hash_table_size_; + + // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j % + // hash_table_size_, but uses far fewer CPU cycles + int64_t mod_bitmask_; +}; + +Status
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252910#comment-16252910 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r151027570 ## File path: cpp/src/arrow/compute/kernels/cast.h ## @@ -60,7 +63,12 @@ Status Cast(FunctionContext* context, const Array& array, const std::shared_ptr& to_type, const CastOptions& options, std::shared_ptr* out); +ARROW_EXPORT +Status Cast(FunctionContext* context, const Datum& value, +const std::shared_ptr& to_type, const CastOptions& options, +Datum* out); + } // namespace compute } // namespace arrow -#endif // ARROW_COMPUTE_CAST_H +#endif // ARROW_COMPUTE_KERNELS_CAST_H Review comment: I probably need to do a once over of Cast and make sure the ChunkedArray form also works This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252904#comment-16252904 ] ASF GitHub Bot commented on ARROW-1559: --- pcmoritz commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-344477903 +1 on the mapbox variant This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252896#comment-16252896 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-344476480 OK, I'm finally ready for some code review on this. So we have one matter to resolve that is a bit annoying. I created a `arrow::compute::Datum` type that is based on `boost::variant`. Fundamentally, we need such a variant type in our compute kernels to be able to operate on different types of data: scalars, arrays, chunked arrays, etc. As an example, we would like to be able to add two `Datum` values together, each of which might be a scalar value (object model TBD) or an array. I suspect that we do not want to have boost in our public headers. So we have a decision to make here: * We could vendor an ASL 2.0-compatible header-only `boost::variant` replacement like https://github.com/mapbox/variant. We're lucky that one exists * We can refactor `Datum` to use a PIMPL. That seems pretty heavy-weight for such a tiny struct -- in such case even a stack allocation of `Datum` would require a heap allocation, which is not free when you have a lot of these things. I'm inclined to take the former approach, mapbox/variant is ~1200 lines of headers and seems to want to be a drop-in replacement for `boost::variant`, is BSD-3 licensed, so probably not the worst dependency to take on. cc @xhochy @cpcloud This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252895#comment-16252895 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor DictionaryBuilder to be a stateful kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-344476480 OK, I'm finally ready for some code review on this. So we have one matter to resolve that is a bit annoying. I created a `arrow::compute::Datum` type that is based on `boost::variant`. Fundamentally, we need such a variant type in our compute kernels to be able to operate on different types of data: scalars, arrays, chunked arrays, etc. I suspect that we do not want to have boost in our public headers. So we have a decision to make here: * We could vendor an ASL 2.0-compatible header-only `boost::variant` replacement like https://github.com/mapbox/variant. We're lucky that one exists * We can refactor `Datum` to use a PIMPL. That seems pretty heavy-weight for such a tiny struct -- in such case even a stack allocation of `Datum` would require a heap allocation, which is not free when you have a lot of these things. I'm inclined to take the former approach, mapbox/variant is ~1200 lines of headers and seems to want to be a drop-in replacement for `boost::variant`, is BSD-3 licensed, so probably not the worst dependency to take on. cc @xhochy @cpcloud This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250843#comment-16250843 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-344145502 I started pulling on threads and the whole thing unraveled. I needed to add a variant output type (`arrow::compute::Datum`) and I wanted to change unique and dictionary-encode to both be kernels. I'm close to having things all working again, I will spend the next day or so re-writing the unit tests to work with the new code structure, and then making sure we don't have any obvious regressions. As one annoying matter, when you dictionary encode a chunked array, you may not have seen all the unique values yet, so the integer type output may change as you observe more chunks. As a result, for the time being I think it is best if we dictionary encode everything to int32 instead of using the adaptive integer builder. If we want to optimize space to make things smaller we can revisit after this patch This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244149#comment-16244149 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-342852623 I'm still working on this. There's quite a bit of work to do to refactor the `DictionaryBuilder` tests to be kernel-based, so with luck I'll have a patch up later today or tomorrow This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236124#comment-16236124 ] ASF GitHub Bot commented on ARROW-1559: --- xhochy commented on issue #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-341489847 @wesm I don't expect you to write the documents. This is just a general remark that we need those. As I'm probably more interested in API stability and moving forward with the casting, I should spend time on it. As my time is also quite severely limited at the moment, I will have a look at writing more documentation. This is not really an easy task but I think necessary for us to get more community members on board. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236111#comment-16236111 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-341486675 I would be happy to write some design documents, but in exchange I need some more help moving along routine development and maintenance of the project. The honest truth is that I simply haven't had enough time to really think through all of the details of what the kernel APIs need to look like -- for example, we need to define an object model to accommodate scalar values (for example, adding a scalar to an array, or casting a scalar from one type to another). The code for the cast kernels could be a lot cleaner than it is. I have a high tolerance for this kind of uncertainty and don't mind doing a lot of refactoring as we figure out the right general shape of the kernel-operator API. I've also been looking at libraries like Dremio and TensorFlow a bit for inspiration. At a high level, the purpose of the kernels is to be able to perform analytics on chunked arrays. Depending on the operator, the kernels may need to be stateful (e.g. reductions, hash-table based analytics) or stateless (elementwise functions, NumPy ufunc-like math, etc.). Operators will need to be able to accommodate dispatch to different variants of kernels (e.g. SIMD-enabled, non-SIMD, GPU) I'd like to spend the majority of my time working solely on kernels and analytics, but I can't do that at the expense of all the other small things that would fall through the cracks otherwise. Out of the 0.8.0 milestone so far, I have resolved 64 JIRAs -- the rest of the Arrow community has done 67. So my effective burden of moving along the project only looking at JIRAs is around 50%. When you add release management and PR maintenance, the number goes above 50% for sure. This is not a complaint, just pointing out with things as they are I am not sure I can do more than I'm already doing. I will be more than happy do more design and architecture work as soon as the community starts sharing more of the development workload. At the moment, to let the development work drop to write more documentation and design docs seems like an unacceptable compromise to me. Getting the project to a format stable 1.0.0 release and to make it suitable for production use for data interchange in Apache Spark and elsewhere is the most important thing for me right now, and engineering work in support of that is going to take priority over design docs I'm on a plane right now so I'm going to hack on these hash kernels for several hours and see how far I can get This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234464#comment-16234464 ] ASF GitHub Bot commented on ARROW-1559: --- xhochy commented on issue #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-341184022 At the moment, I also tend to step a bit back and first have a look at this again in a design document. There are several issues where I have no clear opinion yet but that would probably require some thinking: * Do we need kernel call methods for each level of Array/ChunkedArray/Column? Having them instead of a generic `InvokeUnary` on each of the three data structures might lead to a lot of code duplication or simple pass-through functions. Otherwise having an `InvokeUnary` method would prohibit us from doing some optimizations in the case that we pass over several arrays in a column and could do some operations only once. * My use case here is to selective categorical conversion, my initial approach was to implement `unique(column)` and then use this to create a `DictionaryType` instance that would then be fed to all underlying arrays to make the categorical conversion. This might not be the best solution as the `DictionaryType` instance doesn't contain the hash map anymore and would have to reconstruct it. Also, do we in general have a design document for the kernels? We need to think about state, parallelisation, .. in general. I might have missed this but I think having it integrated into the Arrow documentation will ease entry for future contributors (and myself). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234450#comment-16234450 ] ASF GitHub Bot commented on ARROW-1559: --- xhochy commented on a change in pull request #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r148330119 ## File path: cpp/src/arrow/compute/hash_kernels.cc ## @@ -0,0 +1,131 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/hash_kernels.h" + +#include + +#include "arrow/builder.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" + +namespace arrow { +namespace compute { + +namespace { + +template +class UniqueKernel : public UnaryKernel { + Status Call(FunctionContext* ctx, const Array& input, + std::shared_ptr* out) override { +if (!unique_builder) { + unique_builder = + std::make_shared(input.type(), ctx->memory_pool()); +} + +RETURN_NOT_OK(unique_builder->AppendArray(input)); + +if (out) { + RETURN_NOT_OK(unique_builder->FinishInternal(out)); + unique_builder.reset(); +} + +return Status::OK(); + } + + private: + std::shared_ptr unique_builder; +}; +} + +#define UNIQUE_FUNCTION_CASE(InType)\ + case InType::type_id: \ +*kernel = std::unique_ptr(new UniqueKernel()); \ +break + +Status GetUniqueFunction(const DataType& in_type, std::unique_ptr* kernel) { + switch (in_type.id()) { +// UNIQUE_FUNCTION_CASE(NullType); +// UNIQUE_FUNCTION_CASE(BooleanType); +UNIQUE_FUNCTION_CASE(UInt8Type); +UNIQUE_FUNCTION_CASE(Int8Type); +UNIQUE_FUNCTION_CASE(UInt16Type); +UNIQUE_FUNCTION_CASE(Int16Type); +UNIQUE_FUNCTION_CASE(UInt32Type); +UNIQUE_FUNCTION_CASE(Int32Type); +UNIQUE_FUNCTION_CASE(UInt64Type); +UNIQUE_FUNCTION_CASE(Int64Type); +UNIQUE_FUNCTION_CASE(FloatType); +UNIQUE_FUNCTION_CASE(DoubleType); +UNIQUE_FUNCTION_CASE(Date32Type); +UNIQUE_FUNCTION_CASE(Date64Type); +UNIQUE_FUNCTION_CASE(Time32Type); +UNIQUE_FUNCTION_CASE(Time64Type); +UNIQUE_FUNCTION_CASE(TimestampType); +UNIQUE_FUNCTION_CASE(BinaryType); +UNIQUE_FUNCTION_CASE(StringType); +UNIQUE_FUNCTION_CASE(FixedSizeBinaryType); +default: + break; + } + + if (*kernel == nullptr) { +std::stringstream ss; +ss << "No unique implemented for " << in_type.ToString(); +return Status::NotImplemented(ss.str()); + } + + return Status::OK(); +} + +Status Unique(FunctionContext* ctx, const Array& array, std::shared_ptr* out) { + // Dynamic dispatch to obtain right cast function + std::unique_ptr func; + RETURN_NOT_OK(GetUniqueFunction(*array.type(), )); + + std::shared_ptr out_data; + RETURN_NOT_OK(func->Call(ctx, array, _data)); + *out = MakeArray(out_data); + return Status::OK(); +} + +Status Unique(FunctionContext* ctx, const ChunkedArray& array, + std::shared_ptr* out) { + // Dynamic dispatch to obtain right cast function + std::unique_ptr func; + RETURN_NOT_OK(GetUniqueFunction(*array.type(), )); + + // Call the kernel without out_data on all but the last chunk + for (int i = 0; i < (array.num_chunks() - 1); i++) { +RETURN_NOT_OK(func->Call(ctx, *array.chunk(i), nullptr)); + } + + std::shared_ptr out_data; + // The array has a large chunk, call the kernel and retrieve the result. + RETURN_NOT_OK(func->Call(ctx, *array.chunk(array.num_chunks() - 1), _data)); + *out = MakeArray(out_data); + + return Status::OK(); +} + +Status Unique(FunctionContext* context, const Column& array, + std::shared_ptr* out) { + return Unique(context, *array.data(), out); +} Review comment: This also sounds fine to me. In the end, we would need to have these three levels for all kernels (Array, ChunkedArray, Column) which would mean a lot of pass-through methods. This is an automated message from the Apache Git Service. To respond to the message, please log on
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234449#comment-16234449 ] ASF GitHub Bot commented on ARROW-1559: --- xhochy commented on issue #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-341180952 Feel free to push here, I'm open for the suggested changes. This is probably going to be a bit bigger as it lays some more foundations for the compute library. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226988#comment-16226988 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-340800603 Do you mind if I take some time to work on this today/tomorrow? I can push to this branch so we can collaborate here This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225663#comment-16225663 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r147820948 ## File path: cpp/src/arrow/builder.cc ## @@ -1110,20 +1034,202 @@ Status DictionaryBuilder::AppendDictionary(const Scalar& value) { } \ \ template <> \ - int DictionaryBuilder::HashValue(const WrappedBinary& value) { \ + WrappedBinary UniqueBuilder::GetDictionaryValue(int64_t index) { \ +int32_t v_len; \ +const uint8_t* v = dict_builder_.GetValue(static_cast(index), _len); \ +return WrappedBinary(v, v_len); \ + } \ + \ + template <> \ + Status UniqueBuilder::AppendDictionary(const WrappedBinary& value) { \ +return dict_builder_.Append(value.ptr_, value.length_); \ + } \ + \ + template <> \ + int UniqueBuilder::HashValue(const WrappedBinary& value) { \ return HashUtil::Hash(value.ptr_, value.length_, 0); \ } \ \ template <> \ - bool DictionaryBuilder::SlotDifferent(hash_slot_t index, \ - const WrappedBinary& value) { \ + bool UniqueBuilder::SlotDifferent(hash_slot_t index, \ + const WrappedBinary& value) { \ int32_t other_length; \ const uint8_t* other_value = \ dict_builder_.GetValue(static_cast(index), _length); \ return !(other_length == value.length_ && \ 0 == memcmp(other_value, value.ptr_, value.length_)); \ } +BINARY_UNIQUE_SPECIALIZATIONS(StringType); +BINARY_UNIQUE_SPECIALIZATIONS(BinaryType); + +template +Status UniqueBuilder::FinishInternal(std::shared_ptr* out) { + return dict_builder_.FinishInternal(out); +} + +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; + +// -- +// DictionaryBuilder + +template +DictionaryBuilder::DictionaryBuilder(const std::shared_ptr& type, +MemoryPool* pool) +: ArrayBuilder(type, pool), unique_builder_(type, pool), values_builder_(pool) {} + +DictionaryBuilder::DictionaryBuilder(const std::shared_ptr& type, + MemoryPool* pool) +: ArrayBuilder(type, pool), values_builder_(pool) {} + +DictionaryBuilder::~DictionaryBuilder() {} + +template <> +DictionaryBuilder::DictionaryBuilder( +const std::shared_ptr& type, MemoryPool* pool) +: ArrayBuilder(type, pool), unique_builder_(type, pool), values_builder_(pool) {} + +template +Status DictionaryBuilder::Init(int64_t elements) { + RETURN_NOT_OK(ArrayBuilder::Init(elements)); + RETURN_NOT_OK(unique_builder_.Init(elements)); + return values_builder_.Init(elements); +} + +Status DictionaryBuilder::Init(int64_t elements) { + RETURN_NOT_OK(ArrayBuilder::Init(elements)); + return
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225497#comment-16225497 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r147792444 ## File path: cpp/src/arrow/compute/hash_kernels.cc ## @@ -0,0 +1,131 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/hash_kernels.h" + +#include + +#include "arrow/builder.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" + +namespace arrow { +namespace compute { + +namespace { + +template +class UniqueKernel : public UnaryKernel { + Status Call(FunctionContext* ctx, const Array& input, + std::shared_ptr* out) override { +if (!unique_builder) { + unique_builder = + std::make_shared(input.type(), ctx->memory_pool()); +} + +RETURN_NOT_OK(unique_builder->AppendArray(input)); + +if (out) { + RETURN_NOT_OK(unique_builder->FinishInternal(out)); + unique_builder.reset(); +} + +return Status::OK(); + } + + private: + std::shared_ptr unique_builder; +}; +} + +#define UNIQUE_FUNCTION_CASE(InType)\ + case InType::type_id: \ +*kernel = std::unique_ptr(new UniqueKernel()); \ +break + +Status GetUniqueFunction(const DataType& in_type, std::unique_ptr* kernel) { + switch (in_type.id()) { +// UNIQUE_FUNCTION_CASE(NullType); +// UNIQUE_FUNCTION_CASE(BooleanType); +UNIQUE_FUNCTION_CASE(UInt8Type); +UNIQUE_FUNCTION_CASE(Int8Type); +UNIQUE_FUNCTION_CASE(UInt16Type); +UNIQUE_FUNCTION_CASE(Int16Type); +UNIQUE_FUNCTION_CASE(UInt32Type); +UNIQUE_FUNCTION_CASE(Int32Type); +UNIQUE_FUNCTION_CASE(UInt64Type); +UNIQUE_FUNCTION_CASE(Int64Type); +UNIQUE_FUNCTION_CASE(FloatType); +UNIQUE_FUNCTION_CASE(DoubleType); +UNIQUE_FUNCTION_CASE(Date32Type); +UNIQUE_FUNCTION_CASE(Date64Type); +UNIQUE_FUNCTION_CASE(Time32Type); +UNIQUE_FUNCTION_CASE(Time64Type); +UNIQUE_FUNCTION_CASE(TimestampType); +UNIQUE_FUNCTION_CASE(BinaryType); +UNIQUE_FUNCTION_CASE(StringType); +UNIQUE_FUNCTION_CASE(FixedSizeBinaryType); +default: + break; + } + + if (*kernel == nullptr) { +std::stringstream ss; +ss << "No unique implemented for " << in_type.ToString(); +return Status::NotImplemented(ss.str()); + } + + return Status::OK(); +} + +Status Unique(FunctionContext* ctx, const Array& array, std::shared_ptr* out) { + // Dynamic dispatch to obtain right cast function + std::unique_ptr func; + RETURN_NOT_OK(GetUniqueFunction(*array.type(), )); + + std::shared_ptr out_data; + RETURN_NOT_OK(func->Call(ctx, array, _data)); + *out = MakeArray(out_data); + return Status::OK(); +} + +Status Unique(FunctionContext* ctx, const ChunkedArray& array, + std::shared_ptr* out) { + // Dynamic dispatch to obtain right cast function + std::unique_ptr func; + RETURN_NOT_OK(GetUniqueFunction(*array.type(), )); + + // Call the kernel without out_data on all but the last chunk + for (int i = 0; i < (array.num_chunks() - 1); i++) { +RETURN_NOT_OK(func->Call(ctx, *array.chunk(i), nullptr)); + } + + std::shared_ptr out_data; + // The array has a large chunk, call the kernel and retrieve the result. + RETURN_NOT_OK(func->Call(ctx, *array.chunk(array.num_chunks() - 1), _data)); + *out = MakeArray(out_data); + + return Status::OK(); +} + +Status Unique(FunctionContext* context, const Column& array, + std::shared_ptr* out) { + return Unique(context, *array.data(), out); +} Review comment: I'm not sure about having `Unique` wrappers for ChunkedArray and Column. What do we think about implementing generic `InvokeUnary` methods that accept a generic `UnaryKernel` that applies to these data structures? This is an automated message from the Apache Git Service. To
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225492#comment-16225492 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r147787872 ## File path: cpp/src/arrow/CMakeLists.txt ## @@ -52,6 +52,7 @@ if (ARROW_COMPUTE) set(ARROW_SRCS ${ARROW_SRCS} compute/cast.cc compute/context.cc +compute/hash_kernels.cc Review comment: In a subsequent PR we should create the `src/arrow/compute/kernels` directory for cleaner organization of code This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225491#comment-16225491 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r147791018 ## File path: cpp/src/arrow/builder.cc ## @@ -1110,20 +1034,202 @@ Status DictionaryBuilder::AppendDictionary(const Scalar& value) { } \ \ template <> \ - int DictionaryBuilder::HashValue(const WrappedBinary& value) { \ + WrappedBinary UniqueBuilder::GetDictionaryValue(int64_t index) { \ +int32_t v_len; \ +const uint8_t* v = dict_builder_.GetValue(static_cast(index), _len); \ +return WrappedBinary(v, v_len); \ + } \ + \ + template <> \ + Status UniqueBuilder::AppendDictionary(const WrappedBinary& value) { \ +return dict_builder_.Append(value.ptr_, value.length_); \ + } \ + \ + template <> \ + int UniqueBuilder::HashValue(const WrappedBinary& value) { \ return HashUtil::Hash(value.ptr_, value.length_, 0); \ } \ \ template <> \ - bool DictionaryBuilder::SlotDifferent(hash_slot_t index, \ - const WrappedBinary& value) { \ + bool UniqueBuilder::SlotDifferent(hash_slot_t index, \ + const WrappedBinary& value) { \ int32_t other_length; \ const uint8_t* other_value = \ dict_builder_.GetValue(static_cast(index), _length); \ return !(other_length == value.length_ && \ 0 == memcmp(other_value, value.ptr_, value.length_)); \ } +BINARY_UNIQUE_SPECIALIZATIONS(StringType); +BINARY_UNIQUE_SPECIALIZATIONS(BinaryType); + +template +Status UniqueBuilder::FinishInternal(std::shared_ptr* out) { + return dict_builder_.FinishInternal(out); +} + +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; +template class UniqueBuilder; + +// -- +// DictionaryBuilder + +template +DictionaryBuilder::DictionaryBuilder(const std::shared_ptr& type, +MemoryPool* pool) +: ArrayBuilder(type, pool), unique_builder_(type, pool), values_builder_(pool) {} + +DictionaryBuilder::DictionaryBuilder(const std::shared_ptr& type, + MemoryPool* pool) +: ArrayBuilder(type, pool), values_builder_(pool) {} + +DictionaryBuilder::~DictionaryBuilder() {} + +template <> +DictionaryBuilder::DictionaryBuilder( +const std::shared_ptr& type, MemoryPool* pool) +: ArrayBuilder(type, pool), unique_builder_(type, pool), values_builder_(pool) {} + +template +Status DictionaryBuilder::Init(int64_t elements) { + RETURN_NOT_OK(ArrayBuilder::Init(elements)); + RETURN_NOT_OK(unique_builder_.Init(elements)); + return values_builder_.Init(elements); +} + +Status DictionaryBuilder::Init(int64_t elements) { + RETURN_NOT_OK(ArrayBuilder::Init(elements)); + return
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225493#comment-16225493 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r147788617 ## File path: cpp/src/arrow/builder.cc ## @@ -815,48 +815,35 @@ Status BooleanBuilder::Append(const std::vector& values) { } // -- -// DictionaryBuilder +// UniqueBuilder template -DictionaryBuilder::DictionaryBuilder(const std::shared_ptr& type, -MemoryPool* pool) +UniqueBuilder::UniqueBuilder(const std::shared_ptr& type, MemoryPool* pool) : ArrayBuilder(type, pool), Review comment: We may reach a point where it doesn't make sense to inherit from ArrayBuilder, and rather compose it (or not use it at all) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225494#comment-16225494 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#discussion_r147792444 ## File path: cpp/src/arrow/compute/hash_kernels.cc ## @@ -0,0 +1,131 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/compute/hash_kernels.h" + +#include + +#include "arrow/builder.h" +#include "arrow/compute/context.h" +#include "arrow/compute/kernel.h" + +namespace arrow { +namespace compute { + +namespace { + +template +class UniqueKernel : public UnaryKernel { + Status Call(FunctionContext* ctx, const Array& input, + std::shared_ptr* out) override { +if (!unique_builder) { + unique_builder = + std::make_shared(input.type(), ctx->memory_pool()); +} + +RETURN_NOT_OK(unique_builder->AppendArray(input)); + +if (out) { + RETURN_NOT_OK(unique_builder->FinishInternal(out)); + unique_builder.reset(); +} + +return Status::OK(); + } + + private: + std::shared_ptr unique_builder; +}; +} + +#define UNIQUE_FUNCTION_CASE(InType)\ + case InType::type_id: \ +*kernel = std::unique_ptr(new UniqueKernel()); \ +break + +Status GetUniqueFunction(const DataType& in_type, std::unique_ptr* kernel) { + switch (in_type.id()) { +// UNIQUE_FUNCTION_CASE(NullType); +// UNIQUE_FUNCTION_CASE(BooleanType); +UNIQUE_FUNCTION_CASE(UInt8Type); +UNIQUE_FUNCTION_CASE(Int8Type); +UNIQUE_FUNCTION_CASE(UInt16Type); +UNIQUE_FUNCTION_CASE(Int16Type); +UNIQUE_FUNCTION_CASE(UInt32Type); +UNIQUE_FUNCTION_CASE(Int32Type); +UNIQUE_FUNCTION_CASE(UInt64Type); +UNIQUE_FUNCTION_CASE(Int64Type); +UNIQUE_FUNCTION_CASE(FloatType); +UNIQUE_FUNCTION_CASE(DoubleType); +UNIQUE_FUNCTION_CASE(Date32Type); +UNIQUE_FUNCTION_CASE(Date64Type); +UNIQUE_FUNCTION_CASE(Time32Type); +UNIQUE_FUNCTION_CASE(Time64Type); +UNIQUE_FUNCTION_CASE(TimestampType); +UNIQUE_FUNCTION_CASE(BinaryType); +UNIQUE_FUNCTION_CASE(StringType); +UNIQUE_FUNCTION_CASE(FixedSizeBinaryType); +default: + break; + } + + if (*kernel == nullptr) { +std::stringstream ss; +ss << "No unique implemented for " << in_type.ToString(); +return Status::NotImplemented(ss.str()); + } + + return Status::OK(); +} + +Status Unique(FunctionContext* ctx, const Array& array, std::shared_ptr* out) { + // Dynamic dispatch to obtain right cast function + std::unique_ptr func; + RETURN_NOT_OK(GetUniqueFunction(*array.type(), )); + + std::shared_ptr out_data; + RETURN_NOT_OK(func->Call(ctx, array, _data)); + *out = MakeArray(out_data); + return Status::OK(); +} + +Status Unique(FunctionContext* ctx, const ChunkedArray& array, + std::shared_ptr* out) { + // Dynamic dispatch to obtain right cast function + std::unique_ptr func; + RETURN_NOT_OK(GetUniqueFunction(*array.type(), )); + + // Call the kernel without out_data on all but the last chunk + for (int i = 0; i < (array.num_chunks() - 1); i++) { +RETURN_NOT_OK(func->Call(ctx, *array.chunk(i), nullptr)); + } + + std::shared_ptr out_data; + // The array has a large chunk, call the kernel and retrieve the result. + RETURN_NOT_OK(func->Call(ctx, *array.chunk(array.num_chunks() - 1), _data)); + *out = MakeArray(out_data); + + return Status::OK(); +} + +Status Unique(FunctionContext* context, const Column& array, + std::shared_ptr* out) { + return Unique(context, *array.data(), out); +} Review comment: I'm not sure about having `Unique` wrappers for ChunkedArray and Column. What do we think about implementing generic `UnaryInvoke` methods that accept a generic `UnaryKernel` that applies to these data structures? This is an automated message from the Apache Git Service. To
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224260#comment-16224260 ] ASF GitHub Bot commented on ARROW-1559: --- wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-340315454 Let me spend some time looking at this and I'll let you know my thoughts, worst case by EOD tomorrow (Monday) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223776#comment-16223776 ] ASF GitHub Bot commented on ARROW-1559: --- xhochy commented on issue #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266#issuecomment-340226336 @wesm I'm struggling a bit here on how to continue. The next step would be to implement `X->Dictionary` casts for columns. This can get a bit tricky as we first need to fill a `UniqueBuilder` and then go over all chunks and build the resulting arrays. This then sadly means that we have a single stateful cast kernel instance. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223771#comment-16223771 ] ASF GitHub Bot commented on ARROW-1559: --- xhochy opened a new pull request #1266: WIP: ARROW-1559: Add unique kernel URL: https://github.com/apache/arrow/pull/1266 Only intended to implement selective categorical conversion in `to_pandas()` but it seems that there is a lot missing to do this in a clean fashion. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn > Labels: Analytics, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)
[ https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220896#comment-16220896 ] Wes McKinney commented on ARROW-1559: - All hash table related functions will need an option to address how nulls are handled. For example, null could be included or excluded from the unique values depending on the needs of the application (sometimes null is data, other times it is to be ignored). > [C++] Kernel implementations for "unique" (compute distinct elements of array) > -- > > Key: ARROW-1559 > URL: https://issues.apache.org/jira/browse/ARROW-1559 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney > Labels: Analytics > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)