On Nov 20, 2013 8:34 AM, "Something Something" <[email protected]> wrote: > > Questions: > > 1) I don't see APIs for LEFT, FULL OUTER Joins. True? > 2) Apache Pig provides different join types such as 'replicated', 'skewed'. Now 'replicated' may not be a concern in Spark 'cause everything happens in memory (possibly). > 3) Does the 'join' (which seems to work like INNER Join) guarantee order? For example, can I assume that columns from the left side will appear before columns on left & their order will be preserved?
Sorry I misunderstood your question in my prior email... I somehow thought you were talking about row order. Column order is preserved, yes. Take a look at the join method signatures and their types. And spark isn't limited to "columns" but rather any user-defined type (that is serializable). > > On a side note, it appears, as of now Spark cannot be used as a replacement for Pig - without some major coding. Agree? > > > > > On Mon, Nov 18, 2013 at 10:47 PM, Horia <[email protected]> wrote: >> >> It seems to me that what you want is the following procedure >> - parse each file line by line >> - generate key, value pairs >> - join by key >> >> I think the following should accomplish what you're looking for >> >> val students = sc.textFile("./students.txt") // mapping over this RDD already maps over lines >> val courses = sc.textFile("./courses.txt") // mapping over this RDD already maps over lines >> val left = students.map( x => { >> columns = x.split(",") >> (columns(1), (columns(0), columns(2))) >> } ) >> val right = courses.map( x => { >> columns = x.split(",") >> (columns(0), columns(1)) >> } ) >> val joined = left.join(right) >> >> >> The major difference is selectively returning the fields which you actually want to join, rather than all the fields. A secondary difference is syntactic: you don't need a .map().map() since you can use a slightly more complex function block as illustrated. I think Spark is smart enough to optimize the .map().map() to basically what I've explicitly written... >> >> Horia >> >> >> >> On Mon, Nov 18, 2013 at 10:34 PM, Something Something < [email protected]> wrote: >>> >>> Was my question so dumb? Or, is this not a good use case for Spark? >>> >>> >>> On Sun, Nov 17, 2013 at 11:41 PM, Something Something < [email protected]> wrote: >>>> >>>> I am a newbie to both Spark & Scala, but I've been working with Hadoop/Pig for quite some time. >>>> >>>> We've quite a few ETL processes running in production that use Pig, but now we're evaluating Spark to see if they would indeed run faster. >>>> >>>> A very common use case in our Pig script is joining a file containing Facts to a file containing Dimension data. The joins are of course, inner, left & outer. >>>> >>>> I thought I would start simple. Let's say I've 2 files: >>>> >>>> 1) Students: student_id, course_id, score >>>> 2) Course: course_id, course_title >>>> >>>> We want to produce a file that contains: student_id, course_title, score >>>> >>>> (Note: This is a hypothetical case. The real files have millions of facts & thousands of dimensions) >>>> >>>> Would something like this work? Note: I did say I am a newbie ;) >>>> >>>> val students = sc.textFile("./students.txt") >>>> val courses = sc.textFile("./courses.txt") >>>> val s = students.map(x => x.split(',')) >>>> val left = students.map(x => x.split(',')).map(y => (y(1), y)) >>>> val right = courses.map(x => x.split(',')).map(y => (y(0), y)) >>>> val joined = left.join(right) >>>> >>>> >>>> Any pointers in this regard would be greatly appreciated. Thanks. >>> >>> >> >
