Re: Dataset and lambas
Michael Having VectorUnionSumUDAF implemented would be great. This is quite generic, it does element-wise sum of arrays and maps https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/timeseries/VectorUnionSumUDAF.java and would be massive benefit for a lot of risk analytics. In general most of the brickhouse UDFs are quite useful https://github.com/klout/brickhouse. Happy to help out. On another note what would be involved to have arrays backed by a sparse Array (I am assuming the current implementation is dense), sort of native support for http://spark.apache.org/docs/latest/mllib-data-types.html Regards Deenar Regards Deenar On 7 December 2015 at 20:21, Michael Armbrust wrote: > On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar > wrote: >> >> On a similar note, what is involved in getting native support for some >> user defined functions, so that they are as efficient as native Spark SQL >> expressions? I had one particular one - an arraySum (element wise sum) that >> is heavily used in a lot of risk analytics. >> > > To get the best performance you have to implement a catalyst expression > with codegen. This however is necessarily an internal (unstable) interface > since we are constantly making breaking changes to improve performance. So > if its a common enough operation we should bake it into the engine. > > That said, the code generated encoders that we created for datasets should > lower the cost of calling into external functions as we start using them in > more and more places (i.e. > https://issues.apache.org/jira/browse/SPARK-11593) >
Re: Dataset and lambas
great thanks On Mon, Dec 7, 2015 at 3:02 PM, Michael Armbrust wrote: > These specific JIRAs don't exist yet, but watch SPARK- as we'll make > sure everything shows up there. > > On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers wrote: > >> that's good news about plans to avoid unnecessary conversions, and allow >> access to more efficient internal types. could you point me to the jiras, >> if they exist already? i just tried to find them but had little luck. >> best, koert >> >> On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust >> wrote: >> >>> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers wrote: >>> hello all, DataFrame internally uses a different encoding for values then what the user sees. i assume the same is true for Dataset? >>> >>> This is true. We encode objects in the tungsten binary format using >>> code generated serializers. >>> >>> if so, does this means that a function like Dataset.map needs to convert all the values twice (once to user format and then back to internal format)? or is it perhaps possible to write scala functions that operate on internal formats and avoid this? >>> >>> Currently this is true, but there are plans to avoid unnecessary >>> conversions (back to back maps / filters, etc) and only convert when we >>> need to (shuffles, sorting, hashing, SQL operations). >>> >>> There are also plans to allow you to directly access some of the more >>> efficient internal types by using them as fields in your classes (mutable >>> UTF8 String instead of the immutable java.lang.String). >>> >>> >> >
Re: Dataset and lambas
On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar wrote: > > On a similar note, what is involved in getting native support for some > user defined functions, so that they are as efficient as native Spark SQL > expressions? I had one particular one - an arraySum (element wise sum) that > is heavily used in a lot of risk analytics. > To get the best performance you have to implement a catalyst expression with codegen. This however is necessarily an internal (unstable) interface since we are constantly making breaking changes to improve performance. So if its a common enough operation we should bake it into the engine. That said, the code generated encoders that we created for datasets should lower the cost of calling into external functions as we start using them in more and more places (i.e. https://issues.apache.org/jira/browse/SPARK-11593 )
Re: Dataset and lambas
These specific JIRAs don't exist yet, but watch SPARK- as we'll make sure everything shows up there. On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers wrote: > that's good news about plans to avoid unnecessary conversions, and allow > access to more efficient internal types. could you point me to the jiras, > if they exist already? i just tried to find them but had little luck. > best, koert > > On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust > wrote: > >> On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers wrote: >> >>> hello all, >>> DataFrame internally uses a different encoding for values then what the >>> user sees. i assume the same is true for Dataset? >>> >> >> This is true. We encode objects in the tungsten binary format using code >> generated serializers. >> >> >>> if so, does this means that a function like Dataset.map needs to convert >>> all the values twice (once to user format and then back to internal >>> format)? or is it perhaps possible to write scala functions that operate on >>> internal formats and avoid this? >>> >> >> Currently this is true, but there are plans to avoid unnecessary >> conversions (back to back maps / filters, etc) and only convert when we >> need to (shuffles, sorting, hashing, SQL operations). >> >> There are also plans to allow you to directly access some of the more >> efficient internal types by using them as fields in your classes (mutable >> UTF8 String instead of the immutable java.lang.String). >> >> >
Re: Dataset and lambas
that's good news about plans to avoid unnecessary conversions, and allow access to more efficient internal types. could you point me to the jiras, if they exist already? i just tried to find them but had little luck. best, koert On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust wrote: > On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers wrote: > >> hello all, >> DataFrame internally uses a different encoding for values then what the >> user sees. i assume the same is true for Dataset? >> > > This is true. We encode objects in the tungsten binary format using code > generated serializers. > > >> if so, does this means that a function like Dataset.map needs to convert >> all the values twice (once to user format and then back to internal >> format)? or is it perhaps possible to write scala functions that operate on >> internal formats and avoid this? >> > > Currently this is true, but there are plans to avoid unnecessary > conversions (back to back maps / filters, etc) and only convert when we > need to (shuffles, sorting, hashing, SQL operations). > > There are also plans to allow you to directly access some of the more > efficient internal types by using them as fields in your classes (mutable > UTF8 String instead of the immutable java.lang.String). > >
Re: Dataset and lambas
Hi Michael On a similar note, what is involved in getting native support for some user defined functions, so that they are as efficient as native Spark SQL expressions? I had one particular one - an arraySum (element wise sum) that is heavily used in a lot of risk analytics. Deenar On 5 December 2015 at 21:09, Michael Armbrust wrote: > On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers wrote: > >> hello all, >> DataFrame internally uses a different encoding for values then what the >> user sees. i assume the same is true for Dataset? >> > > This is true. We encode objects in the tungsten binary format using code > generated serializers. > > >> if so, does this means that a function like Dataset.map needs to convert >> all the values twice (once to user format and then back to internal >> format)? or is it perhaps possible to write scala functions that operate on >> internal formats and avoid this? >> > > Currently this is true, but there are plans to avoid unnecessary > conversions (back to back maps / filters, etc) and only convert when we > need to (shuffles, sorting, hashing, SQL operations). > > There are also plans to allow you to directly access some of the more > efficient internal types by using them as fields in your classes (mutable > UTF8 String instead of the immutable java.lang.String). > >
Re: Dataset and lambas
On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers wrote: > hello all, > DataFrame internally uses a different encoding for values then what the > user sees. i assume the same is true for Dataset? > This is true. We encode objects in the tungsten binary format using code generated serializers. > if so, does this means that a function like Dataset.map needs to convert > all the values twice (once to user format and then back to internal > format)? or is it perhaps possible to write scala functions that operate on > internal formats and avoid this? > Currently this is true, but there are plans to avoid unnecessary conversions (back to back maps / filters, etc) and only convert when we need to (shuffles, sorting, hashing, SQL operations). There are also plans to allow you to directly access some of the more efficient internal types by using them as fields in your classes (mutable UTF8 String instead of the immutable java.lang.String).
Dataset and lambas
hello all, DataFrame internally uses a different encoding for values then what the user sees. i assume the same is true for Dataset? if so, does this means that a function like Dataset.map needs to convert all the values twice (once to user format and then back to internal format)? or is it perhaps possible to write scala functions that operate on internal formats and avoid this? i am excited to see lambas back in full force in DataFrame/Dataset world. the few functions on DataFrame that already accepted lambas are not very user friendly. but i worried that what i want (blackbox/general scala functions/lambas so i am not restricted to a few etl-like operators) is at odds with the design of Dataset/DataFrame. best, koert