Re: I need help mapping a PairRDD solution to Dataset
Yeah, that tough. Perhaps you could do something like a flatMap and emit multiple virtual copies of each student for each region that is neighboring their actual region. On Wed, Jan 20, 2016 at 10:50 AM, Steve Lewis wrote: > Thanks - this helps a lot except for the issue of looking at schools in > neighboring regions > > On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust > wrote: > >> The analog to PairRDD is a GroupedDataset (created by calling groupBy), >> which offers similar functionality, but doesn't require you to construct >> new object that are in the form of key/value pairs. It doesn't matter if >> they are complex objects, as long as you can create an encoder for them >> (currently supported for JavaBeans and case classes, but support for custom >> encoders is on the roadmap). These encoders are responsible for both fast >> serialization and providing a view of your object that looks like a row. >> >> Based on the description of your problem, it sounds like you can use >> joinWith and just express the predicate as a column. >> >> import org.apache.spark.sql.functions._ >> ds1.as("child").joinWith(ds2.as("school"), expr("child.region = >> school.region")) >> >> The as operation is only required if you need to differentiate columns on >> either side that have the same name. >> >> Note that by defining the join condition as an expression instead of a >> lambda function, we are giving Spark SQL more information about the join so >> it can often do the comparison without needing to deserialize the object, >> which overtime will let us put more optimizations into the engine. >> >> You can also do this using lambda functions if you want though: >> >> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) >> => >> ... >> } >> >> >> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis >> wrote: >> >>> We have been working a large search problem which we have been solving >>> in the following ways. >>> >>> We have two sets of objects, say children and schools. The object is to >>> find the closest school to each child. There is a distance measure but it >>> is relatively expensive and would be very costly to apply to all pairs. >>> >>> However the map can be divided into regions. If we assume that the >>> closest school to a child is in his region of a neighboring region we need >>> only compute the distance between a child and all schools in his region and >>> neighboring regions. >>> >>> We currently use paired RDDs and a join to do this assigning children to >>> one region and schools to their own region and neighboring regions and then >>> creating a join and computing distances. Note the real problem is more >>> complex. >>> >>> I can create Datasets of the two types of objects but see no Dataset >>> analog for a PairRDD. How could I map my solution using PairRDDs to >>> Datasets - assume the two objects are relatively complex data types and do >>> not look like SQL dataset rows? >>> >>> >>> >> > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > >
Re: I need help mapping a PairRDD solution to Dataset
Thanks - this helps a lot except for the issue of looking at schools in neighboring regions On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust wrote: > The analog to PairRDD is a GroupedDataset (created by calling groupBy), > which offers similar functionality, but doesn't require you to construct > new object that are in the form of key/value pairs. It doesn't matter if > they are complex objects, as long as you can create an encoder for them > (currently supported for JavaBeans and case classes, but support for custom > encoders is on the roadmap). These encoders are responsible for both fast > serialization and providing a view of your object that looks like a row. > > Based on the description of your problem, it sounds like you can use > joinWith and just express the predicate as a column. > > import org.apache.spark.sql.functions._ > ds1.as("child").joinWith(ds2.as("school"), expr("child.region = > school.region")) > > The as operation is only required if you need to differentiate columns on > either side that have the same name. > > Note that by defining the join condition as an expression instead of a > lambda function, we are giving Spark SQL more information about the join so > it can often do the comparison without needing to deserialize the object, > which overtime will let us put more optimizations into the engine. > > You can also do this using lambda functions if you want though: > > ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) > => > ... > } > > > On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis > wrote: > >> We have been working a large search problem which we have been solving in >> the following ways. >> >> We have two sets of objects, say children and schools. The object is to >> find the closest school to each child. There is a distance measure but it >> is relatively expensive and would be very costly to apply to all pairs. >> >> However the map can be divided into regions. If we assume that the >> closest school to a child is in his region of a neighboring region we need >> only compute the distance between a child and all schools in his region and >> neighboring regions. >> >> We currently use paired RDDs and a join to do this assigning children to >> one region and schools to their own region and neighboring regions and then >> creating a join and computing distances. Note the real problem is more >> complex. >> >> I can create Datasets of the two types of objects but see no Dataset >> analog for a PairRDD. How could I map my solution using PairRDDs to >> Datasets - assume the two objects are relatively complex data types and do >> not look like SQL dataset rows? >> >> >> > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: I need help mapping a PairRDD solution to Dataset
The analog to PairRDD is a GroupedDataset (created by calling groupBy), which offers similar functionality, but doesn't require you to construct new object that are in the form of key/value pairs. It doesn't matter if they are complex objects, as long as you can create an encoder for them (currently supported for JavaBeans and case classes, but support for custom encoders is on the roadmap). These encoders are responsible for both fast serialization and providing a view of your object that looks like a row. Based on the description of your problem, it sounds like you can use joinWith and just express the predicate as a column. import org.apache.spark.sql.functions._ ds1.as("child").joinWith(ds2.as("school"), expr("child.region = school.region")) The as operation is only required if you need to differentiate columns on either side that have the same name. Note that by defining the join condition as an expression instead of a lambda function, we are giving Spark SQL more information about the join so it can often do the comparison without needing to deserialize the object, which overtime will let us put more optimizations into the engine. You can also do this using lambda functions if you want though: ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) => ... } On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis wrote: > We have been working a large search problem which we have been solving in > the following ways. > > We have two sets of objects, say children and schools. The object is to > find the closest school to each child. There is a distance measure but it > is relatively expensive and would be very costly to apply to all pairs. > > However the map can be divided into regions. If we assume that the closest > school to a child is in his region of a neighboring region we need only > compute the distance between a child and all schools in his region and > neighboring regions. > > We currently use paired RDDs and a join to do this assigning children to > one region and schools to their own region and neighboring regions and then > creating a join and computing distances. Note the real problem is more > complex. > > I can create Datasets of the two types of objects but see no Dataset > analog for a PairRDD. How could I map my solution using PairRDDs to > Datasets - assume the two objects are relatively complex data types and do > not look like SQL dataset rows? > > >
I need help mapping a PairRDD solution to Dataset
We have been working a large search problem which we have been solving in the following ways. We have two sets of objects, say children and schools. The object is to find the closest school to each child. There is a distance measure but it is relatively expensive and would be very costly to apply to all pairs. However the map can be divided into regions. If we assume that the closest school to a child is in his region of a neighboring region we need only compute the distance between a child and all schools in his region and neighboring regions. We currently use paired RDDs and a join to do this assigning children to one region and schools to their own region and neighboring regions and then creating a join and computing distances. Note the real problem is more complex. I can create Datasets of the two types of objects but see no Dataset analog for a PairRDD. How could I map my solution using PairRDDs to Datasets - assume the two objects are relatively complex data types and do not look like SQL dataset rows?