Re: Dataset and lambas

2015-12-07 Thread Deenar Toraskar
Michael

Having VectorUnionSumUDAF implemented would be great. This is quite
generic, it does element-wise sum of arrays and maps
https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/timeseries/VectorUnionSumUDAF.java
and would be massive benefit for a lot of risk analytics.

In general most of the brickhouse UDFs are quite useful
https://github.com/klout/brickhouse. Happy to help out.

On another note what would be involved to have arrays backed by a sparse
Array (I am assuming the current implementation is dense), sort of native
support for http://spark.apache.org/docs/latest/mllib-data-types.html

Regards
Deenar



Regards
Deenar

On 7 December 2015 at 20:21, Michael Armbrust 
wrote:

> On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar  > wrote:
>>
>> On a similar note, what is involved in getting native support for some
>> user defined functions, so that they are as efficient as native Spark SQL
>> expressions? I had one particular one - an arraySum (element wise sum) that
>> is heavily used in a lot of risk analytics.
>>
>
> To get the best performance you have to implement a catalyst expression
> with codegen.  This however is necessarily an internal (unstable) interface
> since we are constantly making breaking changes to improve performance.  So
> if its a common enough operation we should bake it into the engine.
>
> That said, the code generated encoders that we created for datasets should
> lower the cost of calling into external functions as we start using them in
> more and more places (i.e.
> https://issues.apache.org/jira/browse/SPARK-11593)
>


Re: Dataset and lambas

2015-12-07 Thread Koert Kuipers
great thanks

On Mon, Dec 7, 2015 at 3:02 PM, Michael Armbrust 
wrote:

> These specific JIRAs don't exist yet, but watch SPARK- as we'll make
> sure everything shows up there.
>
> On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers  wrote:
>
>> that's good news about plans to avoid unnecessary conversions, and allow
>> access to more efficient internal types. could you point me to the jiras,
>> if they exist already? i just tried to find them but had little luck.
>> best, koert
>>
>> On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust 
>> wrote:
>>
>>> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers  wrote:
>>>
 hello all,
 DataFrame internally uses a different encoding for values then what the
 user sees. i assume the same is true for Dataset?

>>>
>>> This is true.  We encode objects in the tungsten binary format using
>>> code generated serializers.
>>>
>>>
 if so, does this means that a function like Dataset.map needs to
 convert all the values twice (once to user format and then back to internal
 format)? or is it perhaps possible to write scala functions that operate on
 internal formats and avoid this?

>>>
>>> Currently this is true, but there are plans to avoid unnecessary
>>> conversions (back to back maps / filters, etc) and only convert when we
>>> need to (shuffles, sorting, hashing, SQL operations).
>>>
>>> There are also plans to allow you to directly access some of the more
>>> efficient internal types by using them as fields in your classes (mutable
>>> UTF8 String instead of the immutable java.lang.String).
>>>
>>>
>>
>


Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust
On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar 
wrote:
>
> On a similar note, what is involved in getting native support for some
> user defined functions, so that they are as efficient as native Spark SQL
> expressions? I had one particular one - an arraySum (element wise sum) that
> is heavily used in a lot of risk analytics.
>

To get the best performance you have to implement a catalyst expression
with codegen.  This however is necessarily an internal (unstable) interface
since we are constantly making breaking changes to improve performance.  So
if its a common enough operation we should bake it into the engine.

That said, the code generated encoders that we created for datasets should
lower the cost of calling into external functions as we start using them in
more and more places (i.e. https://issues.apache.org/jira/browse/SPARK-11593
)


Re: Dataset and lambas

2015-12-07 Thread Michael Armbrust
These specific JIRAs don't exist yet, but watch SPARK- as we'll make
sure everything shows up there.

On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers  wrote:

> that's good news about plans to avoid unnecessary conversions, and allow
> access to more efficient internal types. could you point me to the jiras,
> if they exist already? i just tried to find them but had little luck.
> best, koert
>
> On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust 
> wrote:
>
>> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers  wrote:
>>
>>> hello all,
>>> DataFrame internally uses a different encoding for values then what the
>>> user sees. i assume the same is true for Dataset?
>>>
>>
>> This is true.  We encode objects in the tungsten binary format using code
>> generated serializers.
>>
>>
>>> if so, does this means that a function like Dataset.map needs to convert
>>> all the values twice (once to user format and then back to internal
>>> format)? or is it perhaps possible to write scala functions that operate on
>>> internal formats and avoid this?
>>>
>>
>> Currently this is true, but there are plans to avoid unnecessary
>> conversions (back to back maps / filters, etc) and only convert when we
>> need to (shuffles, sorting, hashing, SQL operations).
>>
>> There are also plans to allow you to directly access some of the more
>> efficient internal types by using them as fields in your classes (mutable
>> UTF8 String instead of the immutable java.lang.String).
>>
>>
>


Re: Dataset and lambas

2015-12-06 Thread Koert Kuipers
that's good news about plans to avoid unnecessary conversions, and allow
access to more efficient internal types. could you point me to the jiras,
if they exist already? i just tried to find them but had little luck.
best, koert

On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust 
wrote:

> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers  wrote:
>
>> hello all,
>> DataFrame internally uses a different encoding for values then what the
>> user sees. i assume the same is true for Dataset?
>>
>
> This is true.  We encode objects in the tungsten binary format using code
> generated serializers.
>
>
>> if so, does this means that a function like Dataset.map needs to convert
>> all the values twice (once to user format and then back to internal
>> format)? or is it perhaps possible to write scala functions that operate on
>> internal formats and avoid this?
>>
>
> Currently this is true, but there are plans to avoid unnecessary
> conversions (back to back maps / filters, etc) and only convert when we
> need to (shuffles, sorting, hashing, SQL operations).
>
> There are also plans to allow you to directly access some of the more
> efficient internal types by using them as fields in your classes (mutable
> UTF8 String instead of the immutable java.lang.String).
>
>


Re: Dataset and lambas

2015-12-05 Thread Deenar Toraskar
Hi Michael

On a similar note, what is involved in getting native support for some user
defined functions, so that they are as efficient as native Spark SQL
expressions? I had one particular one - an arraySum (element wise sum) that
is heavily used in a lot of risk analytics.


Deenar

On 5 December 2015 at 21:09, Michael Armbrust 
wrote:

> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers  wrote:
>
>> hello all,
>> DataFrame internally uses a different encoding for values then what the
>> user sees. i assume the same is true for Dataset?
>>
>
> This is true.  We encode objects in the tungsten binary format using code
> generated serializers.
>
>
>> if so, does this means that a function like Dataset.map needs to convert
>> all the values twice (once to user format and then back to internal
>> format)? or is it perhaps possible to write scala functions that operate on
>> internal formats and avoid this?
>>
>
> Currently this is true, but there are plans to avoid unnecessary
> conversions (back to back maps / filters, etc) and only convert when we
> need to (shuffles, sorting, hashing, SQL operations).
>
> There are also plans to allow you to directly access some of the more
> efficient internal types by using them as fields in your classes (mutable
> UTF8 String instead of the immutable java.lang.String).
>
>


Re: Dataset and lambas

2015-12-05 Thread Michael Armbrust
On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers  wrote:

> hello all,
> DataFrame internally uses a different encoding for values then what the
> user sees. i assume the same is true for Dataset?
>

This is true.  We encode objects in the tungsten binary format using code
generated serializers.


> if so, does this means that a function like Dataset.map needs to convert
> all the values twice (once to user format and then back to internal
> format)? or is it perhaps possible to write scala functions that operate on
> internal formats and avoid this?
>

Currently this is true, but there are plans to avoid unnecessary
conversions (back to back maps / filters, etc) and only convert when we
need to (shuffles, sorting, hashing, SQL operations).

There are also plans to allow you to directly access some of the more
efficient internal types by using them as fields in your classes (mutable
UTF8 String instead of the immutable java.lang.String).


Dataset and lambas

2015-12-05 Thread Koert Kuipers
hello all,
DataFrame internally uses a different encoding for values then what the
user sees. i assume the same is true for Dataset?

if so, does this means that a function like Dataset.map needs to convert
all the values twice (once to user format and then back to internal
format)? or is it perhaps possible to write scala functions that operate on
internal formats and avoid this?

i am excited to see lambas back in full force in DataFrame/Dataset world.
the few functions on DataFrame that already accepted lambas are not very
user friendly. but i worried that what i want (blackbox/general scala
functions/lambas so i am not restricted to a few etl-like operators) is at
odds with the design of Dataset/DataFrame.

best, koert