Questions: 1) I don't see APIs for LEFT, FULL OUTER Joins. True? 2) Apache Pig provides different join types such as 'replicated', 'skewed'. Now 'replicated' may not be a concern in Spark 'cause everything happens in memory (possibly). 3) Does the 'join' (which seems to work like INNER Join) guarantee order? For example, can I assume that columns from the left side will appear before columns on left & their order will be preserved?
On a side note, it appears, as of now Spark cannot be used as a replacement for Pig - without some major coding. Agree? On Mon, Nov 18, 2013 at 10:47 PM, Horia <[email protected]> wrote: > It seems to me that what you want is the following procedure > - parse each file line by line > - generate key, value pairs > - join by key > > I think the following should accomplish what you're looking for > > val students = sc.textFile("./students.txt") // mapping over this RDD > already maps over lines > val courses = sc.textFile("./courses.txt") // mapping over this RDD > already maps over lines > val left = students.map( x => { > columns = x.split(",") > (columns(1), (columns(0), columns(2))) > } ) > val right = courses.map( x => { > columns = x.split(",") > (columns(0), columns(1)) > } ) > val joined = left.join(right) > > > The major difference is selectively returning the fields which you > actually want to join, rather than all the fields. A secondary difference > is syntactic: you don't need a .map().map() since you can use a slightly > more complex function block as illustrated. I think Spark is smart enough > to optimize the .map().map() to basically what I've explicitly written... > > Horia > > > > On Mon, Nov 18, 2013 at 10:34 PM, Something Something < > [email protected]> wrote: > >> Was my question so dumb? Or, is this not a good use case for Spark? >> >> >> On Sun, Nov 17, 2013 at 11:41 PM, Something Something < >> [email protected]> wrote: >> >>> I am a newbie to both Spark & Scala, but I've been working with >>> Hadoop/Pig for quite some time. >>> >>> We've quite a few ETL processes running in production that use Pig, but >>> now we're evaluating Spark to see if they would indeed run faster. >>> >>> A very common use case in our Pig script is joining a file containing >>> Facts to a file containing Dimension data. The joins are of course, inner, >>> left & outer. >>> >>> I thought I would start simple. Let's say I've 2 files: >>> >>> 1) Students: student_id, course_id, score >>> 2) Course: course_id, course_title >>> >>> We want to produce a file that contains: student_id, course_title, score >>> >>> (Note: This is a hypothetical case. The real files have millions of >>> facts & thousands of dimensions) >>> >>> Would something like this work? Note: I did say I am a newbie ;) >>> >>> val students = sc.textFile("./students.txt") >>> val courses = sc.textFile("./courses.txt") >>> val s = students.map(x => x.split(',')) >>> val left = students.map(x => x.split(',')).map(y => (y(1), y)) >>> val right = courses.map(x => x.split(',')).map(y => (y(0), y)) >>> val joined = left.join(right) >>> >>> >>> Any pointers in this regard would be greatly appreciated. Thanks. >>> >> >> >
