On Nov 20, 2013 8:34 AM, "Something Something" <[email protected]> wrote: > > Questions: > > 1) I don't see APIs for LEFT, FULL OUTER Joins. True?
The join operations are so documented here: http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions > 2) Apache Pig provides different join types such as 'replicated', 'skewed'. Now 'replicated' may not be a concern in Spark 'cause everything happens in memory (possibly). Spark gives you the ability to plug in your own partitioner (for shuffles) which gives you several options to address data skew. http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.Partitioner > 3) Does the 'join' (which seems to work like INNER Join) guarantee order? For example, can I assume that columns from the left side will appear before columns on left & their order will be preserved? I don't think spark provides this guarantee from a specification perspective. It may be that the implementation does but as a matter of coincidence. See also pluggable partitioner above. Seems like a risky assumption to me but would be glad to see the behavior specified either way. > > On a side note, it appears, as of now Spark cannot be used as a replacement for Pig - without some major coding. Agree? I think once you wrap your head around spark, you'll find it to be a more generally applicable framework... with the caveat that you may need to compose several operations/features together at times whereas Pig may give you pre-defined higher-level operations as a single unit of abstraction. In other words, for some use-cases Pig may afford you more convenience but I believe spark offers more expressive power and control over your computation. (and with the additional caveat that spark is not as mature as pig and therefore you may run into issues/limitations that the spark developers are still working on / pending on the roadmap) > > > > > On Mon, Nov 18, 2013 at 10:47 PM, Horia <[email protected]> wrote: >> >> It seems to me that what you want is the following procedure >> - parse each file line by line >> - generate key, value pairs >> - join by key >> >> I think the following should accomplish what you're looking for >> >> val students = sc.textFile("./students.txt") // mapping over this RDD already maps over lines >> val courses = sc.textFile("./courses.txt") // mapping over this RDD already maps over lines >> val left = students.map( x => { >> columns = x.split(",") >> (columns(1), (columns(0), columns(2))) >> } ) >> val right = courses.map( x => { >> columns = x.split(",") >> (columns(0), columns(1)) >> } ) >> val joined = left.join(right) >> >> >> The major difference is selectively returning the fields which you actually want to join, rather than all the fields. A secondary difference is syntactic: you don't need a .map().map() since you can use a slightly more complex function block as illustrated. I think Spark is smart enough to optimize the .map().map() to basically what I've explicitly written... >> >> Horia >> >> >> >> On Mon, Nov 18, 2013 at 10:34 PM, Something Something < [email protected]> wrote: >>> >>> Was my question so dumb? Or, is this not a good use case for Spark? >>> >>> >>> On Sun, Nov 17, 2013 at 11:41 PM, Something Something < [email protected]> wrote: >>>> >>>> I am a newbie to both Spark & Scala, but I've been working with Hadoop/Pig for quite some time. >>>> >>>> We've quite a few ETL processes running in production that use Pig, but now we're evaluating Spark to see if they would indeed run faster. >>>> >>>> A very common use case in our Pig script is joining a file containing Facts to a file containing Dimension data. The joins are of course, inner, left & outer. >>>> >>>> I thought I would start simple. Let's say I've 2 files: >>>> >>>> 1) Students: student_id, course_id, score >>>> 2) Course: course_id, course_title >>>> >>>> We want to produce a file that contains: student_id, course_title, score >>>> >>>> (Note: This is a hypothetical case. The real files have millions of facts & thousands of dimensions) >>>> >>>> Would something like this work? Note: I did say I am a newbie ;) >>>> >>>> val students = sc.textFile("./students.txt") >>>> val courses = sc.textFile("./courses.txt") >>>> val s = students.map(x => x.split(',')) >>>> val left = students.map(x => x.split(',')).map(y => (y(1), y)) >>>> val right = courses.map(x => x.split(',')).map(y => (y(0), y)) >>>> val joined = left.join(right) >>>> >>>> >>>> Any pointers in this regard would be greatly appreciated. Thanks. >>> >>> >> >
