[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-10-15 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-148306575 we can create a hash expression, and codegen that. And then just use hyperloglog(hash(field)). --- If your project is set up for it, you can reply to this email and have y

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-10-15 Thread hvanhovell
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-148305817 Another thought on hashing. The ClearSpring hash is a generic hash function. We could used very specialized (hopefully fast) hashing functions, because we know the ty

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-10-14 Thread yhuai
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-148227211 Thanks for the pointer. Looks like we only have 32-bit Murmur3 in spark's unsafe module (https://github.com/apache/spark/blob/master/unsafe/src/main/java/org/apache/spark/

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-10-14 Thread hvanhovell
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-148209543 A good article on HLL++ and the hashcode: http://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll --- If your project is set up for it, yo

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-10-14 Thread hvanhovell
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-148209375 @yhuai It doesn't. A 64-bit hashcode is recommended though, especially when would want to approximate a billion or more unique values. I have used the ClearSpring has

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-10-14 Thread yhuai
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8362#discussion_r42054243 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala --- @@ -302,3 +307,397 @@ case class Sum(child: Expre

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/8362 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enab

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-144479867 LGTM, merging this into master, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project do

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-144479889 mer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-11597 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-11434 [Test build #43132 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43132/console) for PR 8362 at commit [`a5fdd07`](https://github.

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-11593 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-144391587 [Test build #43132 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43132/consoleFull) for PR 8362 at commit [`a5fdd07`](https://gith

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-144390880 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not h

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-144390907 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-30 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/8362#discussion_r40782166 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala --- @@ -0,0 +1,125 @@ +/*

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-29 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-144136673 We can work on improving the aggregate operator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pro

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-29 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-144118463 @hvanhovell @rxin Just realized that the tungsten aggregation does not support var-length types in aggregation buffer, so we can't have sparse version without aggregation

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-29 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/8362#discussion_r40696819 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala --- @@ -302,3 +307,393 @@ case class Sum(child: Expr

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-29 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/8362#discussion_r40697020 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala --- @@ -0,0 +1,125 @@ +/* + *

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-28 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-143945761 @yhuai can we make non-codegen path use tungsten aggregate as well? Otherwise we would need to maintain two entirely separate codepath. --- If your project is set up for i

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-28 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-143902631 I took a round, this looks pretty good to me over all. Currently, each grouping key needs 200 bytes (b=8, by default), so the sparse version could help to reduce

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-28 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/8362#discussion_r40610794 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala --- @@ -302,3 +307,393 @@ case class Sum(child: Expr

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-28 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/8362#discussion_r40610006 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala --- @@ -302,3 +307,393 @@ case class Sum(child: Expr

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-22 Thread hvanhovell
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-142323666 @MLnick I am in the process of moving house, so I am a bit slow/late with my response :(... I think it is very usefull to be able to return the HLL registers

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-14 Thread yhuai
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-140141999 @MLnick This one will replace the existing implementation. For now, we will do conversion as shown at https://github.com/apache/spark/pull/8362/files#diff-78b9b210b8cee72e

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-14 Thread MLnick
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-140019691 @hvanhovell @rxin is it intended that this replace the existing `approxCountDistinct ` implementation? And I assume this will happen automatically due to extending `Aggre

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-14 Thread MLnick
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-140019207 @hvanhovell as discussed on the `dev` mailing list, perhaps it would be interesting to allow the return type to include the aggregated HLL registers. This could be (for e

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-10 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-139406671 One quick note: https://github.com/twitter/algebird/pull/491/files anything we can learn from the above pr? --- If your project is set up for it, you can reply to

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-08 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-138699022 Ah ok. will add this to our sprint backlog and get somebody to review it soon. --- If your project is set up for it, you can reply to this email and have your reply appea

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-08 Thread hvanhovell
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-138493630 @rxin the dense version of HLL++ is ready. We could also add this, and add the sparse logic in a follow-up PR. Let me know what you think. I'll close if you'd rather

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-09-07 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-138384295 @hvanhovell do you mind closing this pull request, and re-open when you feel it is ready for review again? --- If your project is set up for it, you can reply to this emai

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-135639349 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-135639351 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-27 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-135639256 [Test build #41719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41719/console) for PR 8362 at commit [`1ea722b`](https://github.

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-27 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-135619890 [Test build #41719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41719/consoleFull) for PR 8362 at commit [`1ea722b`](https://gith

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-27 Thread hvanhovell
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-135619245 Implemented initial non-sparse HLL++. I am going to take a look at the sparse version next week. The results are still equal to the Clearspring HLL+ implementation in

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-135617993 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-135617983 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not h

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133567281 Thanks - I think blackbox testing is fine. But it would be great to apply that at the "unit" testing level, i.e. running directly against the aggregate function, rather tha

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread hvanhovell
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133566608 Thanks. I was aiming for compatibility with the existing approxCountDistinct, but we can also implement HLL++. HLL++ introduces three (orthogonal) refinements

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133562453 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133562454 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133562302 [Test build #41377 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41377/console) for PR 8362 at commit [`e178d9e`](https://github.

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133544919 This made my day. The approach is super cool. Couple suggestions: 1. Can we use HyperLogLogPlus? It's also in streamlib: https://github.com/addthis/stream-

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133525793 [Test build #41377 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41377/consoleFull) for PR 8362 at commit [`e178d9e`](https://gith

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133523850 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133523827 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not h

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread yhuai
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133523761 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabl

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133515988 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pr

[GitHub] spark pull request: [SPARK-9741][SQL] Approximate Count Distinct u...

2015-08-21 Thread hvanhovell
GitHub user hvanhovell opened a pull request: https://github.com/apache/spark/pull/8362 [SPARK-9741][SQL] Approximate Count Distinct using the new UDAF interface. This PR implements a HyperLogLog based Approximate Count Distinct function using the new UDAF interface. The im