I am a newbie to both Spark & Scala, but I've been working with Hadoop/Pig
for quite some time.

We've quite a few ETL processes running in production that use Pig, but now
we're evaluating Spark to see if they would indeed run faster.

A very common use case in our Pig script is joining a file containing Facts
to a file containing Dimension data.  The joins are of course, inner, left
& outer.

I thought I would start simple.  Let's say I've 2 files:

1)  Students:  student_id, course_id, score
2)  Course:  course_id, course_title

We want to produce a file that contains:  student_id, course_title, score

(Note:  This is a hypothetical case.  The real files have millions of facts
& thousands of dimensions)

Would something like this work?  Note:  I did say I am a newbie ;)

val students = sc.textFile("./students.txt")
val courses = sc.textFile("./courses.txt")
val s = students.map(x => x.split(','))
val left = students.map(x => x.split(',')).map(y => (y(1), y))
val right = courses.map(x => x.split(',')).map(y => (y(0), y))
val joined = left.join(right)


Any pointers in this regard would be greatly appreciated.  Thanks.

Reply via email to