Yes it would work and fit spark nicely... Pretty typical I think. On Nov 18, 2013 10:34 PM, "Something Something" <[email protected]> wrote:
> Was my question so dumb? Or, is this not a good use case for Spark? > > > On Sun, Nov 17, 2013 at 11:41 PM, Something Something < > [email protected]> wrote: > >> I am a newbie to both Spark & Scala, but I've been working with >> Hadoop/Pig for quite some time. >> >> We've quite a few ETL processes running in production that use Pig, but >> now we're evaluating Spark to see if they would indeed run faster. >> >> A very common use case in our Pig script is joining a file containing >> Facts to a file containing Dimension data. The joins are of course, inner, >> left & outer. >> >> I thought I would start simple. Let's say I've 2 files: >> >> 1) Students: student_id, course_id, score >> 2) Course: course_id, course_title >> >> We want to produce a file that contains: student_id, course_title, score >> >> (Note: This is a hypothetical case. The real files have millions of >> facts & thousands of dimensions) >> >> Would something like this work? Note: I did say I am a newbie ;) >> >> val students = sc.textFile("./students.txt") >> val courses = sc.textFile("./courses.txt") >> val s = students.map(x => x.split(',')) >> val left = students.map(x => x.split(',')).map(y => (y(1), y)) >> val right = courses.map(x => x.split(',')).map(y => (y(0), y)) >> val joined = left.join(right) >> >> >> Any pointers in this regard would be greatly appreciated. Thanks. >> > >
