Re: HyperLogLogUDT

2015-09-13 Thread Yin Huai
The user implementing a UDAF does not need to consider what is the underlying buffer. Our aggregate operator will figure out if the buffer data types of all aggregate functions used by a query are supported by the UnsafeRow. If so, we will use the UnsafeRow as the buffer. Regarding the performance

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Thanks Yin So how does one ensure a UDAF works with Tungsten and UnsafeRow buffers? Or is this something that will be included in the UDAF interface in future?  Is there a performance difference between Extending UDAF vs Aggregate2? It's also not clear to me how to handle inputs of dif

Re: HyperLogLogUDT

2015-09-12 Thread Yin Huai
Hi Nick, The buffer exposed to UDAF interface is just a view of underlying buffer (this underlying buffer is shared by different aggregate functions and every function takes one or multiple slots). If you need a UDAF, extending UserDefinedAggregationFunction is the preferred approach. AggregateFun

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Ok, that makes sense. So this is (a) more efficient, since as far as I can see it is updating the HLL registers directly in the buffer for each value, and (b) would be "Tungsten-compatible" as it can work against UnsafeRow? Is it currently possible to specify an UnsafeRow as a buffer in a UDAF? So

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
I am typically all for code re-use. The reason for writing this is to prevent the indirection of a UDT and work directly against memory. A UDT will work fine at the moment because we still use GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you would use an UnsafeRow as an A

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
I should add that surely the idea behind UDT is exactly that it can (a) fit automatically into DFs and Tungsten and (b) that it can be used efficiently in writing ones own UDTs and UDAFs? On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath wrote: > Can I ask why you've done this as a custom implem

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Can I ask why you've done this as a custom implementation rather than using StreamLib, which is already implemented and widely used? It seems more portable to me to use a library - for example, I'd like to export the grouped data with raw HLLs to say Elasticsearch, and then do further on-demand agg

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
Hello Nick, I have been working on a (UDT-less) implementation of HLL++. You can find the PR here: https://github.com/apache/spark/pull/8362. This current implements the dense version of HLL++, which is a further development of HLL. It returns a Long, but it shouldn't be to hard to return a Row co

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Inspired by this post: http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/, I've started putting together something based on the Spark 1.5 UDAF interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141 Some questions - 1. How do I get the UDAF

Re: HyperLogLogUDT

2015-07-01 Thread Reynold Xin
Yes - it's very interesting. However, ideally we should have a version of hyperloglog that can work directly against some raw bytes in memory (rather than java objects), in order for this to fit the Tungsten execution model where everything is operating directly against some memory address. On Wed

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
Sure I can copy the code but my aim was more to understand: (A) if this is broadly interesting enough to folks to think about updating / extending the existing UDAF within Spark (b) how to register ones own custom UDAF - in which case it could be a Spark package for example  All examples

Re: HyperLogLogUDT

2015-07-01 Thread Daniel Darabos
It's already possible to just copy the code from countApproxDistinct and access the HLL directly, or do anything you like. On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath wrote: > Any thoughts?

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
Any thoughts? — Sent from Mailbox On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath wrote: > Hey Spark devs > I've been looking at DF UDFs and UDAFs. The approx distinct is using > hyperloglog, > but there is only an option to return the count as a Long. > It can be useful to be able to return