[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258230#comment-16258230
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-345473160
 
 
   I think the hash functions we are using are pretty expensive. We don't need 
super high quality hash functions for this code, they only need to be 
reasonable but use limited CPU cycles. We're also going to want to add SSE4.2 
accelerated versions (since sse4.2 has instrinsics for crc32 hashes) that we 
select at runtime if the host processor supports it


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258064#comment-16258064
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

xhochy commented on a change in pull request #1266: ARROW-1559: [C++] Add 
Unique kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151837230
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/hash.cc
 ##
 @@ -0,0 +1,880 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/hash.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/builder.h"
+#include "arrow/compute/context.h"
+#include "arrow/compute/kernel.h"
+#include "arrow/compute/kernels/util-internal.h"
+#include "arrow/util/hash-util.h"
+
+namespace arrow {
+namespace compute {
+
+namespace {
+
+// Initially 1024 elements
+static constexpr int64_t kInitialHashTableSize = 1 << 10;
+
+typedef int32_t hash_slot_t;
+static constexpr hash_slot_t kHashSlotEmpty = 
std::numeric_limits::max();
+
+// The maximum load factor for the hash table before resizing.
+static constexpr double kMaxHashTableLoad = 0.7;
+
+enum class SIMDMode : char { NOSIMD, SSE4, AVX2 };
+
+#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE)  \
+  if (!KERNEL) {   \
+std::stringstream ss;  \
+ss << FUNCNAME << " not implemented for " << type->ToString(); \
+return Status::NotImplemented(ss.str());   \
+  }
+
+Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* 
out) {
+  auto hash_table = std::make_shared(pool);
+
+  RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size));
+  int32_t* slots = reinterpret_cast(hash_table->mutable_data());
+  std::fill(slots, slots + size, kHashSlotEmpty);
+
+  *out = hash_table;
+  return Status::OK();
+}
+
+// This is a slight design concession -- some hash actions have the possibility
+// of failure. Rather than introduce extra error checking into all actions, we
+// will raise an internal exception so that only the actions where errors can
+// occur will experience the extra overhead
+class HashException : public std::exception {
+ public:
+  explicit HashException(const std::string& msg, StatusCode code = 
StatusCode::Invalid)
+  : msg_(msg), code_(code) {}
+
+  ~HashException() throw() {}
+
+  const char* what() const throw() override;
+
+  StatusCode code() const { return code_; }
+
+ private:
+  std::string msg_;
+  StatusCode code_;
+};
+
+const char* HashException::what() const throw() { return msg_.c_str(); }
+
+class HashTable {
+ public:
+  HashTable(const std::shared_ptr& type, MemoryPool* pool)
+  : type_(type),
+pool_(pool),
+initialized_(false),
+hash_table_(nullptr),
+hash_slots_(nullptr),
+hash_table_size_(0),
+mod_bitmask_(0) {}
+
+  virtual ~HashTable() {}
+
+  virtual Status Append(const ArrayData& input) = 0;
+  virtual Status Flush(std::vector* out) = 0;
+  virtual Status GetDictionary(std::shared_ptr* out) = 0;
+
+ protected:
+  Status Init(int64_t elements);
+
+  std::shared_ptr type_;
+  MemoryPool* pool_;
+  bool initialized_;
+
+  // The hash table contains integer indices that reference the set of observed
+  // distinct values
+  std::shared_ptr hash_table_;
+  hash_slot_t* hash_slots_;
+
+  /// Size of the table. Must be a power of 2.
+  int64_t hash_table_size_;
+
+  // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j %
+  // hash_table_size_, but uses far fewer CPU cycles
+  int64_t mod_bitmask_;
+};
+
+Status HashTable::Init(int64_t elements) {
+  DCHECK_EQ(elements, BitUtil::NextPower2(elements));
+  RETURN_NOT_OK(NewHashTable(elements, pool_, _table_));
+  hash_slots_ = reinterpret_cast(hash_table_->mutable_data());
+  hash_table_size_ = elements;
+  mod_bitmask_ = elements - 1;
+  initialized_ = true;
+  return Status::OK();
+}
+
+template 
+class HashTableKernel : public HashTable {};
+
+// Types of 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257728#comment-16257728
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-345390578
 
 
   Phew, looks like we are finally in the clear. +1, I will merge when the 
build passes and then get busy with follow ups


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257138#comment-16257138
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-345279306
 
 
   Lost access to the Windows machine I had been using for local debugging, 
will have to set something up today to figure out why this isn't linking


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256010#comment-16256010
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique 
kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151545209
 
 

 ##
 File path: cpp/src/arrow/type_traits.h
 ##
 @@ -430,6 +430,17 @@ static inline bool is_binary_like(Type::type type_id) {
   return false;
 }
 
+static inline bool is_dictionary(Type::type type_id) {
+  switch (type_id) {
 
 Review comment:
   nope, fixing


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256008#comment-16256008
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique 
kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151544074
 
 

 ##
 File path: cpp/src/arrow/type.h
 ##
 @@ -498,25 +498,34 @@ class ARROW_EXPORT StructType : public NestedType {
   std::vector GetBufferLayout() const override;
 };
 
-class ARROW_EXPORT Decimal128Type : public FixedSizeBinaryType {
+class ARROW_EXPORT _DecimalBaseType : public FixedSizeBinaryType {
 
 Review comment:
   Sure, this is only temporary until I can remove usages of `DecimalType` from 
parquet-cpp


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255589#comment-16255589
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

cpcloud commented on a change in pull request #1266: ARROW-1559: [C++] Add 
Unique kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151464797
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/hash.cc
 ##
 @@ -0,0 +1,822 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/hash.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/builder.h"
+#include "arrow/compute/context.h"
+#include "arrow/compute/kernel.h"
+#include "arrow/compute/kernels/util-internal.h"
+#include "arrow/util/hash-util.h"
+
+namespace arrow {
+namespace compute {
+
+namespace {
+
+// Initially 1024 elements
+static constexpr int64_t kInitialHashTableSize = 1 << 10;
+
+typedef int32_t hash_slot_t;
+static constexpr hash_slot_t kHashSlotEmpty = 
std::numeric_limits::max();
+
+// The maximum load factor for the hash table before resizing.
+static constexpr double kMaxHashTableLoad = 0.7;
+
+enum class SIMDMode : char { NOSIMD, SSE4, AVX2 };
+
+#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE)  \
+  if (!KERNEL) {   \
+std::stringstream ss;  \
+ss << FUNCNAME << " not implemented for " << type->ToString(); \
+return Status::NotImplemented(ss.str());   \
+  }
+
+Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* 
out) {
+  auto hash_table = std::make_shared(pool);
+
+  RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size));
+  int32_t* slots = reinterpret_cast(hash_table->mutable_data());
+  std::fill(slots, slots + size, kHashSlotEmpty);
+
+  *out = hash_table;
+  return Status::OK();
+}
+
+// This is a slight design concession -- some hash actions have the possibility
+// of failure. Rather than introduce extra error checking into all actions, we
+// will raise an internal exception so that only the actions where errors can
+// occur will experience the extra overhead
+class HashException : public std::exception {
+ public:
+  explicit HashException(const std::string& msg, StatusCode code = 
StatusCode::Invalid)
+  : msg_(msg), code_(code) {}
+
+  ~HashException() throw() {}
+
+  const char* what() const throw() override;
+
+  StatusCode code() const { return code_; }
+
+ private:
+  std::string msg_;
+  StatusCode code_;
+};
+
+const char* HashException::what() const throw() { return msg_.c_str(); }
+
+class HashTable {
+ public:
+  HashTable(const std::shared_ptr& type, MemoryPool* pool)
+  : type_(type),
+pool_(pool),
+initialized_(false),
+hash_table_(nullptr),
+hash_slots_(nullptr),
+hash_table_size_(0),
+mod_bitmask_(0) {}
+
+  virtual ~HashTable() {}
+
+  virtual Status Append(const ArrayData& input) = 0;
+  virtual Status Flush(Datum* out) = 0;
+  virtual Status GetDictionary(std::shared_ptr* out) = 0;
+
+ protected:
+  Status Init(int64_t elements);
+
+  std::shared_ptr type_;
+  MemoryPool* pool_;
+  bool initialized_;
+
+  // The hash table contains integer indices that reference the set of observed
+  // distinct values
+  std::shared_ptr hash_table_;
+  hash_slot_t* hash_slots_;
+
+  /// Size of the table. Must be a power of 2.
+  int64_t hash_table_size_;
+
+  /// Size at which we decide to resize
+  int64_t hash_table_load_threshold_;
+
+  // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j %
+  // hash_table_size_, but uses far fewer CPU cycles
+  int64_t mod_bitmask_;
+};
+
+Status HashTable::Init(int64_t elements) {
+  DCHECK_EQ(elements, BitUtil::NextPower2(elements));
+  RETURN_NOT_OK(NewHashTable(elements, pool_, _table_));
+  hash_slots_ = reinterpret_cast(hash_table_->mutable_data());
+  hash_table_size_ = elements;
+  hash_table_load_threshold_ =
+  static_cast(static_cast(elements) * 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255573#comment-16255573
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

cpcloud commented on a change in pull request #1266: ARROW-1559: [C++] Add 
Unique kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151463494
 
 

 ##
 File path: cpp/src/arrow/type.h
 ##
 @@ -498,25 +498,34 @@ class ARROW_EXPORT StructType : public NestedType {
   std::vector GetBufferLayout() const override;
 };
 
-class ARROW_EXPORT Decimal128Type : public FixedSizeBinaryType {
+class ARROW_EXPORT _DecimalBaseType : public FixedSizeBinaryType {
 
 Review comment:
   I don't think underscores are allowed by the C++ standard. How about 
`DecimalBaseType`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255572#comment-16255572
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

cpcloud commented on a change in pull request #1266: ARROW-1559: [C++] Add 
Unique kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151463749
 
 

 ##
 File path: cpp/src/arrow/type_traits.h
 ##
 @@ -430,6 +430,17 @@ static inline bool is_binary_like(Type::type type_id) {
   return false;
 }
 
+static inline bool is_dictionary(Type::type type_id) {
+  switch (type_id) {
 
 Review comment:
   Any reason not to simplify this to `return type_id == Type::DICTIONARY`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1623#comment-1623
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-344976728
 
 
   I created https://issues.apache.org/jira/browse/ARROW-1828 to take care of 
the boolean hash table implementation, since this patch has already gotten 
quite large


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255546#comment-16255546
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique 
kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151027570
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/cast.h
 ##
 @@ -60,7 +63,12 @@ Status Cast(FunctionContext* context, const Array& array,
 const std::shared_ptr& to_type, const CastOptions& 
options,
 std::shared_ptr* out);
 
+ARROW_EXPORT
+Status Cast(FunctionContext* context, const Datum& value,
+const std::shared_ptr& to_type, const CastOptions& 
options,
+Datum* out);
+
 }  // namespace compute
 }  // namespace arrow
 
-#endif  // ARROW_COMPUTE_CAST_H
+#endif  // ARROW_COMPUTE_KERNELS_CAST_H
 
 Review comment:
   I probably need to do a once over of Cast and make sure the ChunkedArray 
form also works


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255535#comment-16255535
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-344973038
 
 
   This is ready to go modulo getting the MSVC build passing (looks like just 
compiler warnings remaining). Any more feedback?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255537#comment-16255537
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-344973188
 
 
   Also let me know if you like the `x.dictionary_encode()` API. I can always 
rename that to something else


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254708#comment-16254708
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-344807083
 
 
   So we're about a factor of 2 or so away from `pandas.unique`:
   
   ```
   In [1]: import pandas as pd
   import numpy as np
   import pyarrow as pa
  ...: 
   
   In [2]: arr = pa.array(list(range(1)) * 100)
   
   In [3]: v = arr.to_pandas().copy()
   
   In [4]: timeit arr.unique()
   pd.unique(100 loops, best of 3: 10.4 ms per loop
   
   In [5]: timeit pd.unique(v)
   100 loops, best of 3: 4.82 ms per loop
   ```
   
   I suspect this is because of our hash function being slow, but we'll need to 
do some more analysis


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254375#comment-16254375
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique 
kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151275822
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/hash.cc
 ##
 @@ -0,0 +1,880 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/hash.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/builder.h"
+#include "arrow/compute/context.h"
+#include "arrow/compute/kernel.h"
+#include "arrow/compute/kernels/util-internal.h"
+#include "arrow/util/hash-util.h"
+
+namespace arrow {
+namespace compute {
+
+namespace {
+
+// Initially 1024 elements
+static constexpr int64_t kInitialHashTableSize = 1 << 10;
+
+typedef int32_t hash_slot_t;
+static constexpr hash_slot_t kHashSlotEmpty = 
std::numeric_limits::max();
+
+// The maximum load factor for the hash table before resizing.
+static constexpr double kMaxHashTableLoad = 0.7;
+
+enum class SIMDMode : char { NOSIMD, SSE4, AVX2 };
+
+#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE)  \
+  if (!KERNEL) {   \
+std::stringstream ss;  \
+ss << FUNCNAME << " not implemented for " << type->ToString(); \
+return Status::NotImplemented(ss.str());   \
+  }
+
+Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* 
out) {
+  auto hash_table = std::make_shared(pool);
+
+  RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size));
+  int32_t* slots = reinterpret_cast(hash_table->mutable_data());
+  std::fill(slots, slots + size, kHashSlotEmpty);
+
+  *out = hash_table;
+  return Status::OK();
+}
+
+// This is a slight design concession -- some hash actions have the possibility
+// of failure. Rather than introduce extra error checking into all actions, we
+// will raise an internal exception so that only the actions where errors can
+// occur will experience the extra overhead
+class HashException : public std::exception {
+ public:
+  explicit HashException(const std::string& msg, StatusCode code = 
StatusCode::Invalid)
+  : msg_(msg), code_(code) {}
+
+  ~HashException() throw() {}
+
+  const char* what() const throw() override;
+
+  StatusCode code() const { return code_; }
+
+ private:
+  std::string msg_;
+  StatusCode code_;
+};
+
+const char* HashException::what() const throw() { return msg_.c_str(); }
+
+class HashTable {
+ public:
+  HashTable(const std::shared_ptr& type, MemoryPool* pool)
+  : type_(type),
+pool_(pool),
+initialized_(false),
+hash_table_(nullptr),
+hash_slots_(nullptr),
+hash_table_size_(0),
+mod_bitmask_(0) {}
+
+  virtual ~HashTable() {}
+
+  virtual Status Append(const ArrayData& input) = 0;
+  virtual Status Flush(std::vector* out) = 0;
+  virtual Status GetDictionary(std::shared_ptr* out) = 0;
+
+ protected:
+  Status Init(int64_t elements);
+
+  std::shared_ptr type_;
+  MemoryPool* pool_;
+  bool initialized_;
+
+  // The hash table contains integer indices that reference the set of observed
+  // distinct values
+  std::shared_ptr hash_table_;
+  hash_slot_t* hash_slots_;
+
+  /// Size of the table. Must be a power of 2.
+  int64_t hash_table_size_;
+
+  // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j %
+  // hash_table_size_, but uses far fewer CPU cycles
+  int64_t mod_bitmask_;
+};
+
+Status HashTable::Init(int64_t elements) {
+  DCHECK_EQ(elements, BitUtil::NextPower2(elements));
+  RETURN_NOT_OK(NewHashTable(elements, pool_, _table_));
+  hash_slots_ = reinterpret_cast(hash_table_->mutable_data());
+  hash_table_size_ = elements;
+  mod_bitmask_ = elements - 1;
+  initialized_ = true;
+  return Status::OK();
+}
+
+template 
+class HashTableKernel : public HashTable {};
+
+// Types of hash 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254353#comment-16254353
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique 
kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151272186
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/hash.cc
 ##
 @@ -0,0 +1,880 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/hash.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/builder.h"
+#include "arrow/compute/context.h"
+#include "arrow/compute/kernel.h"
+#include "arrow/compute/kernels/util-internal.h"
+#include "arrow/util/hash-util.h"
+
+namespace arrow {
+namespace compute {
+
+namespace {
+
+// Initially 1024 elements
+static constexpr int64_t kInitialHashTableSize = 1 << 10;
+
+typedef int32_t hash_slot_t;
+static constexpr hash_slot_t kHashSlotEmpty = 
std::numeric_limits::max();
+
+// The maximum load factor for the hash table before resizing.
+static constexpr double kMaxHashTableLoad = 0.7;
+
+enum class SIMDMode : char { NOSIMD, SSE4, AVX2 };
+
+#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE)  \
+  if (!KERNEL) {   \
+std::stringstream ss;  \
+ss << FUNCNAME << " not implemented for " << type->ToString(); \
+return Status::NotImplemented(ss.str());   \
+  }
+
+Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* 
out) {
+  auto hash_table = std::make_shared(pool);
+
+  RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size));
+  int32_t* slots = reinterpret_cast(hash_table->mutable_data());
+  std::fill(slots, slots + size, kHashSlotEmpty);
+
+  *out = hash_table;
+  return Status::OK();
+}
+
+// This is a slight design concession -- some hash actions have the possibility
+// of failure. Rather than introduce extra error checking into all actions, we
+// will raise an internal exception so that only the actions where errors can
+// occur will experience the extra overhead
+class HashException : public std::exception {
+ public:
+  explicit HashException(const std::string& msg, StatusCode code = 
StatusCode::Invalid)
+  : msg_(msg), code_(code) {}
+
+  ~HashException() throw() {}
+
+  const char* what() const throw() override;
+
+  StatusCode code() const { return code_; }
+
+ private:
+  std::string msg_;
+  StatusCode code_;
+};
+
+const char* HashException::what() const throw() { return msg_.c_str(); }
+
+class HashTable {
+ public:
+  HashTable(const std::shared_ptr& type, MemoryPool* pool)
+  : type_(type),
+pool_(pool),
+initialized_(false),
+hash_table_(nullptr),
+hash_slots_(nullptr),
+hash_table_size_(0),
+mod_bitmask_(0) {}
+
+  virtual ~HashTable() {}
+
+  virtual Status Append(const ArrayData& input) = 0;
+  virtual Status Flush(std::vector* out) = 0;
+  virtual Status GetDictionary(std::shared_ptr* out) = 0;
+
+ protected:
+  Status Init(int64_t elements);
+
+  std::shared_ptr type_;
+  MemoryPool* pool_;
+  bool initialized_;
+
+  // The hash table contains integer indices that reference the set of observed
+  // distinct values
+  std::shared_ptr hash_table_;
+  hash_slot_t* hash_slots_;
+
+  /// Size of the table. Must be a power of 2.
+  int64_t hash_table_size_;
+
+  // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j %
+  // hash_table_size_, but uses far fewer CPU cycles
+  int64_t mod_bitmask_;
+};
+
+Status HashTable::Init(int64_t elements) {
+  DCHECK_EQ(elements, BitUtil::NextPower2(elements));
+  RETURN_NOT_OK(NewHashTable(elements, pool_, _table_));
+  hash_slots_ = reinterpret_cast(hash_table_->mutable_data());
+  hash_table_size_ = elements;
+  mod_bitmask_ = elements - 1;
+  initialized_ = true;
+  return Status::OK();
+}
+
+template 
+class HashTableKernel : public HashTable {};
+
+// Types of hash 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254362#comment-16254362
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique 
kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151273409
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/hash.cc
 ##
 @@ -0,0 +1,880 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/hash.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/builder.h"
+#include "arrow/compute/context.h"
+#include "arrow/compute/kernel.h"
+#include "arrow/compute/kernels/util-internal.h"
+#include "arrow/util/hash-util.h"
+
+namespace arrow {
+namespace compute {
+
+namespace {
+
+// Initially 1024 elements
+static constexpr int64_t kInitialHashTableSize = 1 << 10;
+
+typedef int32_t hash_slot_t;
+static constexpr hash_slot_t kHashSlotEmpty = 
std::numeric_limits::max();
+
+// The maximum load factor for the hash table before resizing.
+static constexpr double kMaxHashTableLoad = 0.7;
+
+enum class SIMDMode : char { NOSIMD, SSE4, AVX2 };
+
+#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE)  \
+  if (!KERNEL) {   \
+std::stringstream ss;  \
+ss << FUNCNAME << " not implemented for " << type->ToString(); \
+return Status::NotImplemented(ss.str());   \
+  }
+
+Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* 
out) {
+  auto hash_table = std::make_shared(pool);
+
+  RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size));
+  int32_t* slots = reinterpret_cast(hash_table->mutable_data());
+  std::fill(slots, slots + size, kHashSlotEmpty);
+
+  *out = hash_table;
+  return Status::OK();
+}
+
+// This is a slight design concession -- some hash actions have the possibility
+// of failure. Rather than introduce extra error checking into all actions, we
+// will raise an internal exception so that only the actions where errors can
+// occur will experience the extra overhead
+class HashException : public std::exception {
+ public:
+  explicit HashException(const std::string& msg, StatusCode code = 
StatusCode::Invalid)
+  : msg_(msg), code_(code) {}
+
+  ~HashException() throw() {}
+
+  const char* what() const throw() override;
+
+  StatusCode code() const { return code_; }
+
+ private:
+  std::string msg_;
+  StatusCode code_;
+};
+
+const char* HashException::what() const throw() { return msg_.c_str(); }
+
+class HashTable {
+ public:
+  HashTable(const std::shared_ptr& type, MemoryPool* pool)
+  : type_(type),
+pool_(pool),
+initialized_(false),
+hash_table_(nullptr),
+hash_slots_(nullptr),
+hash_table_size_(0),
+mod_bitmask_(0) {}
+
+  virtual ~HashTable() {}
+
+  virtual Status Append(const ArrayData& input) = 0;
+  virtual Status Flush(std::vector* out) = 0;
+  virtual Status GetDictionary(std::shared_ptr* out) = 0;
+
+ protected:
+  Status Init(int64_t elements);
+
+  std::shared_ptr type_;
+  MemoryPool* pool_;
+  bool initialized_;
+
+  // The hash table contains integer indices that reference the set of observed
+  // distinct values
+  std::shared_ptr hash_table_;
+  hash_slot_t* hash_slots_;
+
+  /// Size of the table. Must be a power of 2.
+  int64_t hash_table_size_;
+
+  // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j %
+  // hash_table_size_, but uses far fewer CPU cycles
+  int64_t mod_bitmask_;
+};
+
+Status HashTable::Init(int64_t elements) {
+  DCHECK_EQ(elements, BitUtil::NextPower2(elements));
+  RETURN_NOT_OK(NewHashTable(elements, pool_, _table_));
+  hash_slots_ = reinterpret_cast(hash_table_->mutable_data());
+  hash_table_size_ = elements;
+  mod_bitmask_ = elements - 1;
+  initialized_ = true;
+  return Status::OK();
+}
+
+template 
+class HashTableKernel : public HashTable {};
+
+// Types of hash 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252913#comment-16252913
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-344480057
 
 
   Consider it done. The Python tests are segfaulting due to something it seems 
I broke in the cast code path, so I'll look into that tomorrow and also address 
the `BooleanType` hashing


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252909#comment-16252909
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique 
kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151027415
 
 

 ##
 File path: cpp/src/arrow/array.cc
 ##
 @@ -92,7 +92,7 @@ static inline std::shared_ptr SliceData(const 
ArrayData& data, int64_
   auto new_data = data.ShallowCopy();
   new_data->length = length;
   new_data->offset = offset;
-  new_data->null_count = kUnknownNullCount;
+  new_data->null_count = data.null_count != 0 ? kUnknownNullCount : 0;
 
 Review comment:
   This is a helpful optimization -- if you know your null count is already 0, 
no need to keep thinking it's unknown


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252911#comment-16252911
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique 
kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151027685
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/hash.cc
 ##
 @@ -0,0 +1,890 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/kernels/hash.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/builder.h"
+#include "arrow/compute/context.h"
+#include "arrow/compute/kernel.h"
+#include "arrow/compute/kernels/util-internal.h"
+#include "arrow/util/hash-util.h"
+
+namespace arrow {
+namespace compute {
+
+namespace {
+
+// Initially 1024 elements
+static constexpr int64_t kInitialHashTableSize = 1 << 10;
+
+typedef int32_t hash_slot_t;
+static constexpr hash_slot_t kHashSlotEmpty = 
std::numeric_limits::max();
+
+// The maximum load factor for the hash table before resizing.
+static constexpr double kMaxHashTableLoad = 0.7;
+
+enum class SIMDMode : char { NOSIMD, SSE4, AVX2 };
+
+#define CHECK_IMPLEMENTED(KERNEL, FUNCNAME, TYPE)  \
+  if (!KERNEL) {   \
+std::stringstream ss;  \
+ss << FUNCNAME << " not implemented for " << type->ToString(); \
+return Status::NotImplemented(ss.str());   \
+  }
+
+Status NewHashTable(int64_t size, MemoryPool* pool, std::shared_ptr* 
out) {
+  auto hash_table = std::make_shared(pool);
+
+  RETURN_NOT_OK(hash_table->Resize(sizeof(hash_slot_t) * size));
+  int32_t* slots = reinterpret_cast(hash_table->mutable_data());
+  std::fill(slots, slots + size, kHashSlotEmpty);
+
+  *out = hash_table;
+  return Status::OK();
+}
+
+// This is a slight design concession -- some hash actions have the possibility
+// of failure. Rather than introduce extra error checking into all actions, we
+// will raise an internal exception so that only the actions where errors can
+// occur will experience the extra overhead
+class HashException : public std::exception {
+ public:
+  explicit HashException(const std::string& msg, StatusCode code = 
StatusCode::Invalid)
+  : msg_(msg), code_(code) {}
+
+  ~HashException() throw() {}
+
+  const char* what() const throw() override;
+
+  StatusCode code() const { return code_; }
+
+ private:
+  std::string msg_;
+  StatusCode code_;
+};
+
+const char* HashException::what() const throw() { return msg_.c_str(); }
+
+// TODO(wesm): we do not need this macro yet
+
+// #define HASH_THROW_NOT_OK(EXPR) \
+//   do {  \
+// Status _s = (EXPR); \
+// if (ARROW_PREDICT_FALSE(!_s.ok())) {\
+//   throw HashException(_s.message(), _s.code()); \
+// }   \
+//   } while (false)
+
+class HashTable {
+ public:
+  HashTable(const std::shared_ptr& type, MemoryPool* pool)
+  : type_(type),
+pool_(pool),
+initialized_(false),
+hash_table_(nullptr),
+hash_slots_(nullptr),
+hash_table_size_(0),
+mod_bitmask_(0) {}
+
+  virtual ~HashTable() {}
+
+  virtual Status Append(const ArrayData& input) = 0;
+  virtual Status Flush(std::vector* out) = 0;
+  virtual Status GetDictionary(std::shared_ptr* out) = 0;
+
+ protected:
+  Status Init(int64_t elements);
+
+  std::shared_ptr type_;
+  MemoryPool* pool_;
+  bool initialized_;
+
+  // The hash table contains integer indices that reference the set of observed
+  // distinct values
+  std::shared_ptr hash_table_;
+  hash_slot_t* hash_slots_;
+
+  /// Size of the table. Must be a power of 2.
+  int64_t hash_table_size_;
+
+  // Store hash_table_size_ - 1, so that j & mod_bitmask_ is equivalent to j %
+  // hash_table_size_, but uses far fewer CPU cycles
+  int64_t mod_bitmask_;
+};
+
+Status 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252910#comment-16252910
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: ARROW-1559: [C++] Add Unique 
kernel and refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r151027570
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/cast.h
 ##
 @@ -60,7 +63,12 @@ Status Cast(FunctionContext* context, const Array& array,
 const std::shared_ptr& to_type, const CastOptions& 
options,
 std::shared_ptr* out);
 
+ARROW_EXPORT
+Status Cast(FunctionContext* context, const Datum& value,
+const std::shared_ptr& to_type, const CastOptions& 
options,
+Datum* out);
+
 }  // namespace compute
 }  // namespace arrow
 
-#endif  // ARROW_COMPUTE_CAST_H
+#endif  // ARROW_COMPUTE_KERNELS_CAST_H
 
 Review comment:
   I probably need to do a once over of Cast and make sure the ChunkedArray 
form also works


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252904#comment-16252904
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

pcmoritz commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and 
refactor DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-344477903
 
 
   +1 on the mapbox variant


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252896#comment-16252896
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-344476480
 
 
   OK, I'm finally ready for some code review on this.
   
   So we have one matter to resolve that is a bit annoying. I created a 
`arrow::compute::Datum` type that is based on `boost::variant`. Fundamentally, 
we need such a variant type in our compute kernels to be able to operate on 
different types of data: scalars, arrays, chunked arrays, etc. As an example, 
we would like to be able to add two `Datum` values together, each of which 
might be a scalar value (object model TBD) or an array.
   
   I suspect that we do not want to have boost in our public headers. So we 
have a decision to make here:
   
   * We could vendor an ASL 2.0-compatible header-only `boost::variant` 
replacement like https://github.com/mapbox/variant. We're lucky that one exists
   * We can refactor `Datum` to use a PIMPL. That seems pretty heavy-weight for 
such a tiny struct -- in such case even a stack allocation of `Datum` would 
require a heap allocation, which is not free when you have a lot of these 
things.
   
   I'm inclined to take the former approach, mapbox/variant is ~1200 lines of 
headers and seems to want to be a drop-in replacement for `boost::variant`, is 
BSD-3 licensed, so probably not the worst dependency to take on. 
   
   cc @xhochy @cpcloud 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252895#comment-16252895
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: ARROW-1559: [C++] Add Unique kernel and refactor 
DictionaryBuilder to be a stateful kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-344476480
 
 
   OK, I'm finally ready for some code review on this.
   
   So we have one matter to resolve that is a bit annoying. I created a 
`arrow::compute::Datum` type that is based on `boost::variant`. Fundamentally, 
we need such a variant type in our compute kernels to be able to operate on 
different types of data: scalars, arrays, chunked arrays, etc. 
   
   I suspect that we do not want to have boost in our public headers. So we 
have a decision to make here:
   
   * We could vendor an ASL 2.0-compatible header-only `boost::variant` 
replacement like https://github.com/mapbox/variant. We're lucky that one exists
   * We can refactor `Datum` to use a PIMPL. That seems pretty heavy-weight for 
such a tiny struct -- in such case even a stack allocation of `Datum` would 
require a heap allocation, which is not free when you have a lot of these 
things.
   
   I'm inclined to take the former approach, mapbox/variant is ~1200 lines of 
headers and seems to want to be a drop-in replacement for `boost::variant`, is 
BSD-3 licensed, so probably not the worst dependency to take on. 
   
   cc @xhochy @cpcloud 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250843#comment-16250843
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-344145502
 
 
   I started pulling on threads and the whole thing unraveled. I needed to add 
a variant output type (`arrow::compute::Datum`) and I wanted to change unique 
and dictionary-encode to both be kernels. I'm close to having things all 
working again, I will spend the next day or so re-writing the unit tests to 
work with the new code structure, and then making sure we don't have any 
obvious regressions.
   
   As one annoying matter, when you dictionary encode a chunked array, you may 
not have seen all the unique values yet, so the integer type output may change 
as you observe more chunks. As a result, for the time being I think it is best 
if we dictionary encode everything to int32 instead of using the adaptive 
integer builder. If we want to optimize space to make things smaller we can 
revisit after this patch


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244149#comment-16244149
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-342852623
 
 
   I'm still working on this. There's quite a bit of work to do to refactor the 
`DictionaryBuilder` tests to be kernel-based, so with luck I'll have a patch up 
later today or tomorrow


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236124#comment-16236124
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

xhochy commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-341489847
 
 
   @wesm I don't expect you to write the documents. This is just a general 
remark that we need those. As I'm probably more interested in API stability and 
moving forward with the casting, I should spend time on it. 
   
   As my time is also quite severely limited at the moment, I will have a look 
at writing more documentation. This is not really an easy task but I think 
necessary for us to get more community members on board.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236111#comment-16236111
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-341486675
 
 
   I would be happy to write some design documents, but in exchange I need some 
more help moving along routine development and maintenance of the project. The 
honest truth is that I simply haven't had enough time to really think through 
all of the details of what the kernel APIs need to look like -- for example, we 
need to define an object model to accommodate scalar values (for example, 
adding a scalar to an array, or casting a scalar from one type to another). The 
code for the cast kernels could be a lot cleaner than it is. I have a high 
tolerance for this kind of uncertainty and don't mind doing a lot of 
refactoring as we figure out the right general shape of the kernel-operator 
API. I've also been looking at libraries like Dremio and TensorFlow a bit for 
inspiration. 
   
   At a high level, the purpose of the kernels is to be able to perform 
analytics on chunked arrays. Depending on the operator, the kernels may need to 
be stateful (e.g. reductions, hash-table based analytics) or stateless 
(elementwise functions, NumPy ufunc-like math, etc.). Operators will need to be 
able to accommodate dispatch to different variants of kernels (e.g. 
SIMD-enabled, non-SIMD, GPU)
   
   I'd like to spend the majority of my time working solely on kernels and 
analytics, but I can't do that at the expense of all the other small things 
that would fall through the cracks otherwise. Out of the 0.8.0 milestone so 
far, I have resolved 64 JIRAs -- the rest of the Arrow community has done 67. 
So my effective burden of moving along the project only looking at JIRAs is 
around 50%. When you add release management and PR maintenance, the number goes 
above 50% for sure. 
   
   This is not a complaint, just pointing out with things as they are I am not 
sure I can do more than I'm already doing. I will be more than happy do more 
design and architecture work as soon as the community starts sharing more of 
the development workload. At the moment, to let the development work drop to 
write more documentation and design docs seems like an unacceptable compromise 
to me. Getting the project to a format stable 1.0.0 release and to make it 
suitable for production use for data interchange in Apache Spark and elsewhere 
is the most important thing for me right now, and engineering work in support 
of that is going to take priority over design docs
   
   I'm on a plane right now so I'm going to hack on these hash kernels for 
several hours and see how far I can get


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234464#comment-16234464
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

xhochy commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-341184022
 
 
   At the moment, I also tend to step a bit back and first have a look at this 
again in a design document. There are several issues where I have no clear 
opinion yet but that would probably require some thinking:
   
* Do we need kernel call methods for each level of 
Array/ChunkedArray/Column? Having them instead of a generic `InvokeUnary` on 
each of the three data structures might lead to a lot of code duplication or 
simple pass-through functions. Otherwise having an `InvokeUnary` method would 
prohibit us from doing some optimizations in the case that we pass over several 
arrays in a column and could do some operations only once.
* My use case here is to selective categorical conversion, my initial 
approach was to implement `unique(column)` and then use this to create a 
`DictionaryType` instance that would then be fed to all underlying arrays to 
make the categorical conversion. This might not be the best solution as the 
`DictionaryType` instance doesn't contain the hash map anymore and would have 
to reconstruct it.
   
   Also, do we in general have a design document for the kernels? We need to 
think about state, parallelisation, .. in general. I might have missed this but 
I think having it integrated into the Arrow documentation will ease entry for 
future contributors (and myself).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234450#comment-16234450
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

xhochy commented on a change in pull request #1266: WIP: ARROW-1559: Add unique 
kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r148330119
 
 

 ##
 File path: cpp/src/arrow/compute/hash_kernels.cc
 ##
 @@ -0,0 +1,131 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/hash_kernels.h"
+
+#include 
+
+#include "arrow/builder.h"
+#include "arrow/compute/context.h"
+#include "arrow/compute/kernel.h"
+
+namespace arrow {
+namespace compute {
+
+namespace {
+
+template 
+class UniqueKernel : public UnaryKernel {
+  Status Call(FunctionContext* ctx, const Array& input,
+  std::shared_ptr* out) override {
+if (!unique_builder) {
+  unique_builder =
+  std::make_shared(input.type(), 
ctx->memory_pool());
+}
+
+RETURN_NOT_OK(unique_builder->AppendArray(input));
+
+if (out) {
+  RETURN_NOT_OK(unique_builder->FinishInternal(out));
+  unique_builder.reset();
+}
+
+return Status::OK();
+  }
+
+ private:
+  std::shared_ptr unique_builder;
+};
+}
+
+#define UNIQUE_FUNCTION_CASE(InType)\
+  case InType::type_id: \
+*kernel = std::unique_ptr(new UniqueKernel()); \
+break
+
+Status GetUniqueFunction(const DataType& in_type, 
std::unique_ptr* kernel) {
+  switch (in_type.id()) {
+// UNIQUE_FUNCTION_CASE(NullType);
+// UNIQUE_FUNCTION_CASE(BooleanType);
+UNIQUE_FUNCTION_CASE(UInt8Type);
+UNIQUE_FUNCTION_CASE(Int8Type);
+UNIQUE_FUNCTION_CASE(UInt16Type);
+UNIQUE_FUNCTION_CASE(Int16Type);
+UNIQUE_FUNCTION_CASE(UInt32Type);
+UNIQUE_FUNCTION_CASE(Int32Type);
+UNIQUE_FUNCTION_CASE(UInt64Type);
+UNIQUE_FUNCTION_CASE(Int64Type);
+UNIQUE_FUNCTION_CASE(FloatType);
+UNIQUE_FUNCTION_CASE(DoubleType);
+UNIQUE_FUNCTION_CASE(Date32Type);
+UNIQUE_FUNCTION_CASE(Date64Type);
+UNIQUE_FUNCTION_CASE(Time32Type);
+UNIQUE_FUNCTION_CASE(Time64Type);
+UNIQUE_FUNCTION_CASE(TimestampType);
+UNIQUE_FUNCTION_CASE(BinaryType);
+UNIQUE_FUNCTION_CASE(StringType);
+UNIQUE_FUNCTION_CASE(FixedSizeBinaryType);
+default:
+  break;
+  }
+
+  if (*kernel == nullptr) {
+std::stringstream ss;
+ss << "No unique implemented for " << in_type.ToString();
+return Status::NotImplemented(ss.str());
+  }
+
+  return Status::OK();
+}
+
+Status Unique(FunctionContext* ctx, const Array& array, 
std::shared_ptr* out) {
+  // Dynamic dispatch to obtain right cast function
+  std::unique_ptr func;
+  RETURN_NOT_OK(GetUniqueFunction(*array.type(), ));
+
+  std::shared_ptr out_data;
+  RETURN_NOT_OK(func->Call(ctx, array, _data));
+  *out = MakeArray(out_data);
+  return Status::OK();
+}
+
+Status Unique(FunctionContext* ctx, const ChunkedArray& array,
+  std::shared_ptr* out) {
+  // Dynamic dispatch to obtain right cast function
+  std::unique_ptr func;
+  RETURN_NOT_OK(GetUniqueFunction(*array.type(), ));
+
+  // Call the kernel without out_data on all but the last chunk
+  for (int i = 0; i < (array.num_chunks() - 1); i++) {
+RETURN_NOT_OK(func->Call(ctx, *array.chunk(i), nullptr));
+  }
+
+  std::shared_ptr out_data;
+  // The array has a large chunk, call the kernel and retrieve the result.
+  RETURN_NOT_OK(func->Call(ctx, *array.chunk(array.num_chunks() - 1), 
_data));
+  *out = MakeArray(out_data);
+
+  return Status::OK();
+}
+
+Status Unique(FunctionContext* context, const Column& array,
+  std::shared_ptr* out) {
+  return Unique(context, *array.data(), out);
+}
 
 Review comment:
   This also sounds fine to me. In the end, we would need to have these three 
levels for all kernels (Array, ChunkedArray, Column) which would mean a lot of 
pass-through methods.


This is an automated message from the Apache Git Service.
To respond to the message, please log on 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-11-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234449#comment-16234449
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

xhochy commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-341180952
 
 
   Feel free to push here, I'm open for the suggested changes. This is probably 
going to be a bit bigger as it lays some more foundations for the compute 
library.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226988#comment-16226988
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-340800603
 
 
   Do you mind if I take some time to work on this today/tomorrow? I can push 
to this branch so we can collaborate here


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225663#comment-16225663
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique 
kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r147820948
 
 

 ##
 File path: cpp/src/arrow/builder.cc
 ##
 @@ -1110,20 +1034,202 @@ Status DictionaryBuilder::AppendDictionary(const 
Scalar& value) {
   }
 \

 \
   template <>  
 \
-  int DictionaryBuilder::HashValue(const WrappedBinary& value) { 
 \
+  WrappedBinary UniqueBuilder::GetDictionaryValue(int64_t index) {   
 \
+int32_t v_len; 
 \
+const uint8_t* v = dict_builder_.GetValue(static_cast(index), 
_len); \
+return WrappedBinary(v, v_len);
 \
+  }
 \
+   
 \
+  template <>  
 \
+  Status UniqueBuilder::AppendDictionary(const WrappedBinary& value) {   
 \
+return dict_builder_.Append(value.ptr_, value.length_);
 \
+  }
 \
+   
 \
+  template <>  
 \
+  int UniqueBuilder::HashValue(const WrappedBinary& value) { 
 \
 return HashUtil::Hash(value.ptr_, value.length_, 0);   
 \
   }
 \

 \
   template <>  
 \
-  bool DictionaryBuilder::SlotDifferent(hash_slot_t index,   
 \
-  const WrappedBinary& value) {
 \
+  bool UniqueBuilder::SlotDifferent(hash_slot_t index,   
 \
+  const WrappedBinary& value) {
 \
 int32_t other_length;  
 \
 const uint8_t* other_value =   
 \
 dict_builder_.GetValue(static_cast(index), _length);
 \
 return !(other_length == value.length_ &&  
 \
  0 == memcmp(other_value, value.ptr_, value.length_)); 
 \
   }
 
+BINARY_UNIQUE_SPECIALIZATIONS(StringType);
+BINARY_UNIQUE_SPECIALIZATIONS(BinaryType);
+
+template 
+Status UniqueBuilder::FinishInternal(std::shared_ptr* out) {
+  return dict_builder_.FinishInternal(out);
+}
+
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+
+// --
+// DictionaryBuilder
+
+template 
+DictionaryBuilder::DictionaryBuilder(const std::shared_ptr& type,
+MemoryPool* pool)
+: ArrayBuilder(type, pool), unique_builder_(type, pool), 
values_builder_(pool) {}
+
+DictionaryBuilder::DictionaryBuilder(const 
std::shared_ptr& type,
+   MemoryPool* pool)
+: ArrayBuilder(type, pool), values_builder_(pool) {}
+
+DictionaryBuilder::~DictionaryBuilder() {}
+
+template <>
+DictionaryBuilder::DictionaryBuilder(
+const std::shared_ptr& type, MemoryPool* pool)
+: ArrayBuilder(type, pool), unique_builder_(type, pool), 
values_builder_(pool) {}
+
+template 
+Status DictionaryBuilder::Init(int64_t elements) {
+  RETURN_NOT_OK(ArrayBuilder::Init(elements));
+  RETURN_NOT_OK(unique_builder_.Init(elements));
+  return values_builder_.Init(elements);
+}
+
+Status DictionaryBuilder::Init(int64_t elements) {
+  RETURN_NOT_OK(ArrayBuilder::Init(elements));
+  return 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225497#comment-16225497
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique 
kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r147792444
 
 

 ##
 File path: cpp/src/arrow/compute/hash_kernels.cc
 ##
 @@ -0,0 +1,131 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/hash_kernels.h"
+
+#include 
+
+#include "arrow/builder.h"
+#include "arrow/compute/context.h"
+#include "arrow/compute/kernel.h"
+
+namespace arrow {
+namespace compute {
+
+namespace {
+
+template 
+class UniqueKernel : public UnaryKernel {
+  Status Call(FunctionContext* ctx, const Array& input,
+  std::shared_ptr* out) override {
+if (!unique_builder) {
+  unique_builder =
+  std::make_shared(input.type(), 
ctx->memory_pool());
+}
+
+RETURN_NOT_OK(unique_builder->AppendArray(input));
+
+if (out) {
+  RETURN_NOT_OK(unique_builder->FinishInternal(out));
+  unique_builder.reset();
+}
+
+return Status::OK();
+  }
+
+ private:
+  std::shared_ptr unique_builder;
+};
+}
+
+#define UNIQUE_FUNCTION_CASE(InType)\
+  case InType::type_id: \
+*kernel = std::unique_ptr(new UniqueKernel()); \
+break
+
+Status GetUniqueFunction(const DataType& in_type, 
std::unique_ptr* kernel) {
+  switch (in_type.id()) {
+// UNIQUE_FUNCTION_CASE(NullType);
+// UNIQUE_FUNCTION_CASE(BooleanType);
+UNIQUE_FUNCTION_CASE(UInt8Type);
+UNIQUE_FUNCTION_CASE(Int8Type);
+UNIQUE_FUNCTION_CASE(UInt16Type);
+UNIQUE_FUNCTION_CASE(Int16Type);
+UNIQUE_FUNCTION_CASE(UInt32Type);
+UNIQUE_FUNCTION_CASE(Int32Type);
+UNIQUE_FUNCTION_CASE(UInt64Type);
+UNIQUE_FUNCTION_CASE(Int64Type);
+UNIQUE_FUNCTION_CASE(FloatType);
+UNIQUE_FUNCTION_CASE(DoubleType);
+UNIQUE_FUNCTION_CASE(Date32Type);
+UNIQUE_FUNCTION_CASE(Date64Type);
+UNIQUE_FUNCTION_CASE(Time32Type);
+UNIQUE_FUNCTION_CASE(Time64Type);
+UNIQUE_FUNCTION_CASE(TimestampType);
+UNIQUE_FUNCTION_CASE(BinaryType);
+UNIQUE_FUNCTION_CASE(StringType);
+UNIQUE_FUNCTION_CASE(FixedSizeBinaryType);
+default:
+  break;
+  }
+
+  if (*kernel == nullptr) {
+std::stringstream ss;
+ss << "No unique implemented for " << in_type.ToString();
+return Status::NotImplemented(ss.str());
+  }
+
+  return Status::OK();
+}
+
+Status Unique(FunctionContext* ctx, const Array& array, 
std::shared_ptr* out) {
+  // Dynamic dispatch to obtain right cast function
+  std::unique_ptr func;
+  RETURN_NOT_OK(GetUniqueFunction(*array.type(), ));
+
+  std::shared_ptr out_data;
+  RETURN_NOT_OK(func->Call(ctx, array, _data));
+  *out = MakeArray(out_data);
+  return Status::OK();
+}
+
+Status Unique(FunctionContext* ctx, const ChunkedArray& array,
+  std::shared_ptr* out) {
+  // Dynamic dispatch to obtain right cast function
+  std::unique_ptr func;
+  RETURN_NOT_OK(GetUniqueFunction(*array.type(), ));
+
+  // Call the kernel without out_data on all but the last chunk
+  for (int i = 0; i < (array.num_chunks() - 1); i++) {
+RETURN_NOT_OK(func->Call(ctx, *array.chunk(i), nullptr));
+  }
+
+  std::shared_ptr out_data;
+  // The array has a large chunk, call the kernel and retrieve the result.
+  RETURN_NOT_OK(func->Call(ctx, *array.chunk(array.num_chunks() - 1), 
_data));
+  *out = MakeArray(out_data);
+
+  return Status::OK();
+}
+
+Status Unique(FunctionContext* context, const Column& array,
+  std::shared_ptr* out) {
+  return Unique(context, *array.data(), out);
+}
 
 Review comment:
   I'm not sure about having `Unique` wrappers for ChunkedArray and Column. 
What do we think about implementing generic `InvokeUnary` methods that accept a 
generic `UnaryKernel` that applies to these data structures?


This is an automated message from the Apache Git Service.
To 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225492#comment-16225492
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique 
kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r147787872
 
 

 ##
 File path: cpp/src/arrow/CMakeLists.txt
 ##
 @@ -52,6 +52,7 @@ if (ARROW_COMPUTE)
   set(ARROW_SRCS ${ARROW_SRCS}
 compute/cast.cc
 compute/context.cc
+compute/hash_kernels.cc
 
 Review comment:
   In a subsequent PR we should create the `src/arrow/compute/kernels` 
directory for cleaner organization of code


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225491#comment-16225491
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique 
kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r147791018
 
 

 ##
 File path: cpp/src/arrow/builder.cc
 ##
 @@ -1110,20 +1034,202 @@ Status DictionaryBuilder::AppendDictionary(const 
Scalar& value) {
   }
 \

 \
   template <>  
 \
-  int DictionaryBuilder::HashValue(const WrappedBinary& value) { 
 \
+  WrappedBinary UniqueBuilder::GetDictionaryValue(int64_t index) {   
 \
+int32_t v_len; 
 \
+const uint8_t* v = dict_builder_.GetValue(static_cast(index), 
_len); \
+return WrappedBinary(v, v_len);
 \
+  }
 \
+   
 \
+  template <>  
 \
+  Status UniqueBuilder::AppendDictionary(const WrappedBinary& value) {   
 \
+return dict_builder_.Append(value.ptr_, value.length_);
 \
+  }
 \
+   
 \
+  template <>  
 \
+  int UniqueBuilder::HashValue(const WrappedBinary& value) { 
 \
 return HashUtil::Hash(value.ptr_, value.length_, 0);   
 \
   }
 \

 \
   template <>  
 \
-  bool DictionaryBuilder::SlotDifferent(hash_slot_t index,   
 \
-  const WrappedBinary& value) {
 \
+  bool UniqueBuilder::SlotDifferent(hash_slot_t index,   
 \
+  const WrappedBinary& value) {
 \
 int32_t other_length;  
 \
 const uint8_t* other_value =   
 \
 dict_builder_.GetValue(static_cast(index), _length);
 \
 return !(other_length == value.length_ &&  
 \
  0 == memcmp(other_value, value.ptr_, value.length_)); 
 \
   }
 
+BINARY_UNIQUE_SPECIALIZATIONS(StringType);
+BINARY_UNIQUE_SPECIALIZATIONS(BinaryType);
+
+template 
+Status UniqueBuilder::FinishInternal(std::shared_ptr* out) {
+  return dict_builder_.FinishInternal(out);
+}
+
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+template class UniqueBuilder;
+
+// --
+// DictionaryBuilder
+
+template 
+DictionaryBuilder::DictionaryBuilder(const std::shared_ptr& type,
+MemoryPool* pool)
+: ArrayBuilder(type, pool), unique_builder_(type, pool), 
values_builder_(pool) {}
+
+DictionaryBuilder::DictionaryBuilder(const 
std::shared_ptr& type,
+   MemoryPool* pool)
+: ArrayBuilder(type, pool), values_builder_(pool) {}
+
+DictionaryBuilder::~DictionaryBuilder() {}
+
+template <>
+DictionaryBuilder::DictionaryBuilder(
+const std::shared_ptr& type, MemoryPool* pool)
+: ArrayBuilder(type, pool), unique_builder_(type, pool), 
values_builder_(pool) {}
+
+template 
+Status DictionaryBuilder::Init(int64_t elements) {
+  RETURN_NOT_OK(ArrayBuilder::Init(elements));
+  RETURN_NOT_OK(unique_builder_.Init(elements));
+  return values_builder_.Init(elements);
+}
+
+Status DictionaryBuilder::Init(int64_t elements) {
+  RETURN_NOT_OK(ArrayBuilder::Init(elements));
+  return 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225493#comment-16225493
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique 
kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r147788617
 
 

 ##
 File path: cpp/src/arrow/builder.cc
 ##
 @@ -815,48 +815,35 @@ Status BooleanBuilder::Append(const std::vector& 
values) {
 }
 
 // --
-// DictionaryBuilder
+// UniqueBuilder
 
 template 
-DictionaryBuilder::DictionaryBuilder(const std::shared_ptr& type,
-MemoryPool* pool)
+UniqueBuilder::UniqueBuilder(const std::shared_ptr& type, 
MemoryPool* pool)
 : ArrayBuilder(type, pool),
 
 Review comment:
   We may reach a point where it doesn't make sense to inherit from 
ArrayBuilder, and rather compose it (or not use it at all)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225494#comment-16225494
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on a change in pull request #1266: WIP: ARROW-1559: Add unique 
kernel
URL: https://github.com/apache/arrow/pull/1266#discussion_r147792444
 
 

 ##
 File path: cpp/src/arrow/compute/hash_kernels.cc
 ##
 @@ -0,0 +1,131 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/compute/hash_kernels.h"
+
+#include 
+
+#include "arrow/builder.h"
+#include "arrow/compute/context.h"
+#include "arrow/compute/kernel.h"
+
+namespace arrow {
+namespace compute {
+
+namespace {
+
+template 
+class UniqueKernel : public UnaryKernel {
+  Status Call(FunctionContext* ctx, const Array& input,
+  std::shared_ptr* out) override {
+if (!unique_builder) {
+  unique_builder =
+  std::make_shared(input.type(), 
ctx->memory_pool());
+}
+
+RETURN_NOT_OK(unique_builder->AppendArray(input));
+
+if (out) {
+  RETURN_NOT_OK(unique_builder->FinishInternal(out));
+  unique_builder.reset();
+}
+
+return Status::OK();
+  }
+
+ private:
+  std::shared_ptr unique_builder;
+};
+}
+
+#define UNIQUE_FUNCTION_CASE(InType)\
+  case InType::type_id: \
+*kernel = std::unique_ptr(new UniqueKernel()); \
+break
+
+Status GetUniqueFunction(const DataType& in_type, 
std::unique_ptr* kernel) {
+  switch (in_type.id()) {
+// UNIQUE_FUNCTION_CASE(NullType);
+// UNIQUE_FUNCTION_CASE(BooleanType);
+UNIQUE_FUNCTION_CASE(UInt8Type);
+UNIQUE_FUNCTION_CASE(Int8Type);
+UNIQUE_FUNCTION_CASE(UInt16Type);
+UNIQUE_FUNCTION_CASE(Int16Type);
+UNIQUE_FUNCTION_CASE(UInt32Type);
+UNIQUE_FUNCTION_CASE(Int32Type);
+UNIQUE_FUNCTION_CASE(UInt64Type);
+UNIQUE_FUNCTION_CASE(Int64Type);
+UNIQUE_FUNCTION_CASE(FloatType);
+UNIQUE_FUNCTION_CASE(DoubleType);
+UNIQUE_FUNCTION_CASE(Date32Type);
+UNIQUE_FUNCTION_CASE(Date64Type);
+UNIQUE_FUNCTION_CASE(Time32Type);
+UNIQUE_FUNCTION_CASE(Time64Type);
+UNIQUE_FUNCTION_CASE(TimestampType);
+UNIQUE_FUNCTION_CASE(BinaryType);
+UNIQUE_FUNCTION_CASE(StringType);
+UNIQUE_FUNCTION_CASE(FixedSizeBinaryType);
+default:
+  break;
+  }
+
+  if (*kernel == nullptr) {
+std::stringstream ss;
+ss << "No unique implemented for " << in_type.ToString();
+return Status::NotImplemented(ss.str());
+  }
+
+  return Status::OK();
+}
+
+Status Unique(FunctionContext* ctx, const Array& array, 
std::shared_ptr* out) {
+  // Dynamic dispatch to obtain right cast function
+  std::unique_ptr func;
+  RETURN_NOT_OK(GetUniqueFunction(*array.type(), ));
+
+  std::shared_ptr out_data;
+  RETURN_NOT_OK(func->Call(ctx, array, _data));
+  *out = MakeArray(out_data);
+  return Status::OK();
+}
+
+Status Unique(FunctionContext* ctx, const ChunkedArray& array,
+  std::shared_ptr* out) {
+  // Dynamic dispatch to obtain right cast function
+  std::unique_ptr func;
+  RETURN_NOT_OK(GetUniqueFunction(*array.type(), ));
+
+  // Call the kernel without out_data on all but the last chunk
+  for (int i = 0; i < (array.num_chunks() - 1); i++) {
+RETURN_NOT_OK(func->Call(ctx, *array.chunk(i), nullptr));
+  }
+
+  std::shared_ptr out_data;
+  // The array has a large chunk, call the kernel and retrieve the result.
+  RETURN_NOT_OK(func->Call(ctx, *array.chunk(array.num_chunks() - 1), 
_data));
+  *out = MakeArray(out_data);
+
+  return Status::OK();
+}
+
+Status Unique(FunctionContext* context, const Column& array,
+  std::shared_ptr* out) {
+  return Unique(context, *array.data(), out);
+}
 
 Review comment:
   I'm not sure about having `Unique` wrappers for ChunkedArray and Column. 
What do we think about implementing generic `UnaryInvoke` methods that accept a 
generic `UnaryKernel` that applies to these data structures?


This is an automated message from the Apache Git Service.
To 

[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224260#comment-16224260
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-340315454
 
 
   Let me spend some time looking at this and I'll let you know my thoughts, 
worst case by EOD tomorrow (Monday)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223776#comment-16223776
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

xhochy commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-340226336
 
 
   @wesm I'm struggling a bit here on how to continue. The next step would be 
to implement `X->Dictionary` casts for columns. This can get a bit tricky as we 
first need to fill a `UniqueBuilder` and then go over all chunks and build the 
resulting arrays. This then sadly means that we have a single stateful cast 
kernel instance.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223771#comment-16223771
 ] 

ASF GitHub Bot commented on ARROW-1559:
---

xhochy opened a new pull request #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266
 
 
   Only intended to implement selective categorical conversion in `to_pandas()` 
but it seems that there is a lot missing to do this in a clean fashion.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>  Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1559) [C++] Kernel implementations for "unique" (compute distinct elements of array)

2017-10-26 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220896#comment-16220896
 ] 

Wes McKinney commented on ARROW-1559:
-

All hash table related functions will need an option to address how nulls are 
handled. For example, null could be included or excluded from the unique values 
depending on the needs of the application (sometimes null is data, other times 
it is to be ignored). 

> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> --
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>  Labels: Analytics
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)