[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370337#comment-16370337
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm closed pull request #1551: ARROW-1942: [C++] Hash table specializations 
for small integers
URL: https://github.com/apache/arrow/pull/1551
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/compute/compute-benchmark.cc 
b/cpp/src/arrow/compute/compute-benchmark.cc
index 44df44139..646010550 100644
--- a/cpp/src/arrow/compute/compute-benchmark.cc
+++ b/cpp/src/arrow/compute/compute-benchmark.cc
@@ -94,7 +94,7 @@ struct HashParams {
 std::vector is_valid;
 test::randint(length, 0, num_unique, );
 for (int64_t draw : draws) {
-  values.push_back(draw);
+  values.push_back(static_cast(draw));
 }
 
 if (this->null_percent > 0) {
@@ -170,6 +170,14 @@ void BenchDictionaryEncode(benchmark::State& state, const 
ParamType& params,
   state.SetBytesProcessed(state.iterations() * 
params.GetBytesProcessed(length));
 }
 
+static void BM_UniqueUInt8NoNulls(benchmark::State& state) {
+  BenchUnique(state, HashParams{0}, state.range(0), state.range(1));
+}
+
+static void BM_UniqueUInt8WithNulls(benchmark::State& state) {
+  BenchUnique(state, HashParams{0.05}, state.range(0), 
state.range(1));
+}
+
 static void BM_UniqueInt64NoNulls(benchmark::State& state) {
   BenchUnique(state, HashParams{0}, state.range(0), state.range(1));
 }
@@ -207,5 +215,17 @@ ADD_HASH_ARGS(BENCHMARK(BM_UniqueInt64WithNulls));
 ADD_HASH_ARGS(BENCHMARK(BM_UniqueString10bytes));
 ADD_HASH_ARGS(BENCHMARK(BM_UniqueString100bytes));
 
+BENCHMARK(BM_UniqueUInt8NoNulls)
+->Args({kHashBenchmarkLength, 200})
+->MinTime(1.0)
+->Unit(benchmark::kMicrosecond)
+->UseRealTime();
+
+BENCHMARK(BM_UniqueUInt8WithNulls)
+->Args({kHashBenchmarkLength, 200})
+->MinTime(1.0)
+->Unit(benchmark::kMicrosecond)
+->UseRealTime();
+
 }  // namespace compute
 }  // namespace arrow
diff --git a/cpp/src/arrow/compute/kernels/hash.cc 
b/cpp/src/arrow/compute/kernels/hash.cc
index da9797f77..dbce6e561 100644
--- a/cpp/src/arrow/compute/kernels/hash.cc
+++ b/cpp/src/arrow/compute/kernels/hash.cc
@@ -221,7 +221,10 @@ struct HashDictionary {
   }
 
 template 
-class HashTableKernel : public 
HashTable {
+class HashTableKernel<
+Type, Action,
+typename std::enable_if::type>
+: public HashTable {
  public:
   using T = typename Type::c_type;
 
@@ -611,6 +614,80 @@ class HashTableKernel
   int32_t dict_size_;
 };
 
+// --
+// Hash table pass for uint8 and int8
+
+template 
+inline int Hash8Bit(const T val) {
+  return 0;
+}
+
+template <>
+inline int Hash8Bit(const uint8_t val) {
+  return val;
+}
+
+template <>
+inline int Hash8Bit(const int8_t val) {
+  return val + 128;
+}
+
+template 
+class HashTableKernel : public 
HashTable {
+ public:
+  using T = typename Type::c_type;
+
+  HashTableKernel(const std::shared_ptr& type, MemoryPool* pool)
+  : HashTable(type, pool) {
+std::fill(table_, table_ + 256, kHashSlotEmpty);
+  }
+
+  Status Append(const ArrayData& arr) override {
+const T* values = GetValues(arr, 1);
+auto action = static_cast(this);
+RETURN_NOT_OK(action->Reserve(arr.length));
+
+#define HASH_INNER_LOOP()  \
+  const T value = values[i];   \
+  const int hash = Hash8Bit(value); \
+  hash_slot_t slot = table_[hash]; \
+   \
+  if (slot == kHashSlotEmpty) {\
+if (!Action::allow_expand) {   \
+  throw HashException("Encountered new dictionary value"); \
+}  \
+   \
+slot = static_cast(dict_.size()); \
+table_[hash] = slot;   \
+dict_.push_back(value);\
+action->ObserveNotFound(slot); \
+  } else { \
+action->ObserveFound(slot);\
+  }
+
+GENERIC_HASH_PASS(HASH_INNER_LOOP);
+
+#undef HASH_INNER_LOOP
+
+return Status::OK();
+  }
+
+  Status 

[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370336#comment-16370336
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-367057568
 
 
   Appveyor builds looking fine here: 
https://ci.appveyor.com/project/xuepanchen/arrow/build/1.0.25. Merging, thank 
you!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369669#comment-16369669
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366853886
 
 
   @xuepanchen I added a template for the 8 bit hash function to avoid 
arithmetic in the uint8 case
   
   before this change:
   
   ```
   $ ./release/compute-benchmark --benchmark_filter=UInt8
   Run on (8 X 4399.69 MHz CPU s)
   2018-02-19 21:57:13
   Benchmark Time   
CPU Iterations
   
---
   BM_UniqueUInt8NoNulls/16M/200/min_time:1.000/real_time 8339 us   
8339 us166   1.87372GB/s
   BM_UniqueUInt8WithNulls/16M/200/min_time:1.000/real_time  28536 us  
28537 us 49 560.7MB/s
   ```
   
   after this change:
   
   ```
   $ ./release/compute-benchmark --benchmark_filter=UInt8
   Run on (8 X 4400 MHz CPU s)
   2018-02-19 21:55:51
   Benchmark Time   
CPU Iterations
   
---
   BM_UniqueUInt8NoNulls/16M/200/min_time:1.000/real_time 7749 us   
7749 us180   2.01641GB/s
   BM_UniqueUInt8WithNulls/16M/200/min_time:1.000/real_time  28042 us  
28042 us 50   570.571MB/s
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369668#comment-16369668
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366853563
 
 
   @jreback I thought so, thanks for confirming =)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369667#comment-16369667
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

jreback commented on issue #1551: ARROW-1942: [C++] Hash table specializations 
for small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366853421
 
 
   fyi in pandas we currently promote to int64 for smaller item size for hash 
operations 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369666#comment-16369666
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366852383
 
 
   Top level numbers OK to me:
   
   ```
   In [1]: import numpy as np
   
   In [2]: arr = np.random.randint(0, 200, size=1000)
   
   In [3]: import pyarrow as pa
   
   In [4]: pa
   Out[4]: 
   
   In [5]: parr = pa.array(arr)
   
   In [9]: timeit result = parr.unique()
   33.5 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [10]: import pandas as pd
   
   In [11]: timeit result2 = pd.unique(arr)
   25.7 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [12]: timeit result2 = np.unique(arr)
   296 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
   
   In [13]: parr_int8 = pa.array(arr.astype('int8'))
   
   In [14]: arr_int8 = arr.astype('int8')
   
   In [15]: timeit result = parr_int8.unique()
   10.1 ms ± 99.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   
   In [16]: timeit result = pd.unique(arr_int8)
   35.3 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [17]: timeit result = np.unique(arr_int8)
   282 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
   ```
   
   So we're about 30% slower than pandas for int64 at the moment (for this 
limited benchmark at least), which suggests plenty of room for improvement.
   
   Everything else looks good. +1, will merge on green build. Thanks 
@xuepanchen!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369651#comment-16369651
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366849040
 
 
   I squashed the branch. Having a look at compute-benchmark


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358798#comment-16358798
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

xuepanchen commented on issue #1551: ARROW-1942: [C++] Hash table 
specializations for small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-364520515
 
 
   @wesm got you. Compute-benchmark test failed when I run it locally and I 
can't figure out why. It seems to fail too before I added the 8-bit integer case


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358736#comment-16358736
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-364501814
 
 
   @xuepanchen it looks like something went haywire in the merge (but it is OK, 
no need to change anything) -- in the future you can do `git pull --ff-only` or 
`git pull --rebase` when working on the branch


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357447#comment-16357447
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-364220281
 
 
   @xuepanchen I made the functor changes. Can you add a benchmark for the 
8-bit integer case? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355914#comment-16355914
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

xuepanchen commented on a change in pull request #1551: ARROW-1942: [C++] Hash 
table specializations for small integers
URL: https://github.com/apache/arrow/pull/1551#discussion_r166725654
 
 

 ##
 File path: cpp/src/arrow/compute/kernels/util-internal.h
 ##
 @@ -47,11 +52,11 @@ using enable_if_timestamp =
 typename std::enable_if::value>::type;
 
 template 
-using enable_if_has_c_type =
-typename std::enable_if::value ||
-std::is_base_of::value ||
-std::is_base_of::value ||
-std::is_base_of::value>::type;
+using enable_if_has_c_type = typename std::enable_if<
+!std::is_same::value && !std::is_same::value &&
 
 Review comment:
   I am not sure how to exclude 8-bit integers in the functor declarations. I 
am thinking creating a new template function like 
"enable_if_has_c_type_excluding_8bit_integer" here and use it in the 
declaration of hash table pass for primitive types


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)