Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Michael Armbrust
The analog to PairRDD is a GroupedDataset (created by calling groupBy),
which offers similar functionality, but doesn't require you to construct
new object that are in the form of key/value pairs.  It doesn't matter if
they are complex objects, as long as you can create an encoder for them
(currently supported for JavaBeans and case classes, but support for custom
encoders is on the roadmap).  These encoders are responsible for both fast
serialization and providing a view of your object that looks like a row.

Based on the description of your problem, it sounds like you can use
joinWith and just express the predicate as a column.

import org.apache.spark.sql.functions._
ds1.as("child").joinWith(ds2.as("school"), expr("child.region =
school.region"))

The as operation is only required if you need to differentiate columns on
either side that have the same name.

Note that by defining the join condition as an expression instead of a
lambda function, we are giving Spark SQL more information about the join so
it can often do the comparison without needing to deserialize the object,
which overtime will let us put more optimizations into the engine.

You can also do this using lambda functions if you want though:

ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) =>
  ...
}


On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis  wrote:

> We have been working a large search problem which we have been solving in
> the following ways.
>
> We have two sets of objects, say children and schools. The object is to
> find the closest school to each child. There is a distance measure but it
> is relatively expensive and would be very costly to apply to all pairs.
>
> However the map can be divided into regions. If we assume that the closest
> school to a child is in his region of a neighboring region we need only
> compute the distance between a child and all schools in his region and
> neighboring regions.
>
> We currently use paired RDDs and a join to do this assigning children to
> one region and schools to their own region and neighboring regions and then
> creating a join and computing distances. Note the real problem is more
> complex.
>
> I can create Datasets of the two types of objects but see no Dataset
> analog for a PairRDD. How could I map my solution using PairRDDs to
> Datasets - assume the two objects are relatively complex data types and do
> not look like SQL dataset rows?
>
>
>


Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Steve Lewis
Thanks - this helps a lot except for the issue of looking at schools in
neighboring regions

On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust 
wrote:

> The analog to PairRDD is a GroupedDataset (created by calling groupBy),
> which offers similar functionality, but doesn't require you to construct
> new object that are in the form of key/value pairs.  It doesn't matter if
> they are complex objects, as long as you can create an encoder for them
> (currently supported for JavaBeans and case classes, but support for custom
> encoders is on the roadmap).  These encoders are responsible for both fast
> serialization and providing a view of your object that looks like a row.
>
> Based on the description of your problem, it sounds like you can use
> joinWith and just express the predicate as a column.
>
> import org.apache.spark.sql.functions._
> ds1.as("child").joinWith(ds2.as("school"), expr("child.region =
> school.region"))
>
> The as operation is only required if you need to differentiate columns on
> either side that have the same name.
>
> Note that by defining the join condition as an expression instead of a
> lambda function, we are giving Spark SQL more information about the join so
> it can often do the comparison without needing to deserialize the object,
> which overtime will let us put more optimizations into the engine.
>
> You can also do this using lambda functions if you want though:
>
> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2)
> =>
>   ...
> }
>
>
> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis 
> wrote:
>
>> We have been working a large search problem which we have been solving in
>> the following ways.
>>
>> We have two sets of objects, say children and schools. The object is to
>> find the closest school to each child. There is a distance measure but it
>> is relatively expensive and would be very costly to apply to all pairs.
>>
>> However the map can be divided into regions. If we assume that the
>> closest school to a child is in his region of a neighboring region we need
>> only compute the distance between a child and all schools in his region and
>> neighboring regions.
>>
>> We currently use paired RDDs and a join to do this assigning children to
>> one region and schools to their own region and neighboring regions and then
>> creating a join and computing distances. Note the real problem is more
>> complex.
>>
>> I can create Datasets of the two types of objects but see no Dataset
>> analog for a PairRDD. How could I map my solution using PairRDDs to
>> Datasets - assume the two objects are relatively complex data types and do
>> not look like SQL dataset rows?
>>
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: I need help mapping a PairRDD solution to Dataset

2016-01-20 Thread Michael Armbrust
Yeah, that tough.  Perhaps you could do something like a flatMap and emit
multiple virtual copies of each student for each region that is neighboring
their actual region.

On Wed, Jan 20, 2016 at 10:50 AM, Steve Lewis  wrote:

> Thanks - this helps a lot except for the issue of looking at schools in
> neighboring regions
>
> On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust  > wrote:
>
>> The analog to PairRDD is a GroupedDataset (created by calling groupBy),
>> which offers similar functionality, but doesn't require you to construct
>> new object that are in the form of key/value pairs.  It doesn't matter if
>> they are complex objects, as long as you can create an encoder for them
>> (currently supported for JavaBeans and case classes, but support for custom
>> encoders is on the roadmap).  These encoders are responsible for both fast
>> serialization and providing a view of your object that looks like a row.
>>
>> Based on the description of your problem, it sounds like you can use
>> joinWith and just express the predicate as a column.
>>
>> import org.apache.spark.sql.functions._
>> ds1.as("child").joinWith(ds2.as("school"), expr("child.region =
>> school.region"))
>>
>> The as operation is only required if you need to differentiate columns on
>> either side that have the same name.
>>
>> Note that by defining the join condition as an expression instead of a
>> lambda function, we are giving Spark SQL more information about the join so
>> it can often do the comparison without needing to deserialize the object,
>> which overtime will let us put more optimizations into the engine.
>>
>> You can also do this using lambda functions if you want though:
>>
>> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2)
>> =>
>>   ...
>> }
>>
>>
>> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis 
>> wrote:
>>
>>> We have been working a large search problem which we have been solving
>>> in the following ways.
>>>
>>> We have two sets of objects, say children and schools. The object is to
>>> find the closest school to each child. There is a distance measure but it
>>> is relatively expensive and would be very costly to apply to all pairs.
>>>
>>> However the map can be divided into regions. If we assume that the
>>> closest school to a child is in his region of a neighboring region we need
>>> only compute the distance between a child and all schools in his region and
>>> neighboring regions.
>>>
>>> We currently use paired RDDs and a join to do this assigning children to
>>> one region and schools to their own region and neighboring regions and then
>>> creating a join and computing distances. Note the real problem is more
>>> complex.
>>>
>>> I can create Datasets of the two types of objects but see no Dataset
>>> analog for a PairRDD. How could I map my solution using PairRDDs to
>>> Datasets - assume the two objects are relatively complex data types and do
>>> not look like SQL dataset rows?
>>>
>>>
>>>
>>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>