It seems to me that what you want is the following procedure
- parse each file line by line
- generate key, value pairs
- join by key

I think the following should accomplish what you're looking for

val students = sc.textFile("./students.txt")    // mapping over this RDD
already maps over lines
val courses = sc.textFile("./courses.txt")    // mapping over this RDD
already maps over lines
val left = students.map( x => {
    columns = x.split(",")
    (columns(1), (columns(0), columns(2)))
} )
val right = courses.map( x => {
    columns = x.split(",")
    (columns(0), columns(1))
} )
val joined = left.join(right)


The major difference is selectively returning the fields which you actually
want to join, rather than all the fields. A secondary difference is
syntactic: you don't need a .map().map() since you can use a slightly more
complex function block as illustrated. I think Spark is smart enough to
optimize the .map().map() to basically what I've explicitly written...

Horia



On Mon, Nov 18, 2013 at 10:34 PM, Something Something <
[email protected]> wrote:

> Was my question so dumb?  Or, is this not a good use case for Spark?
>
>
> On Sun, Nov 17, 2013 at 11:41 PM, Something Something <
> [email protected]> wrote:
>
>> I am a newbie to both Spark & Scala, but I've been working with
>> Hadoop/Pig for quite some time.
>>
>> We've quite a few ETL processes running in production that use Pig, but
>> now we're evaluating Spark to see if they would indeed run faster.
>>
>> A very common use case in our Pig script is joining a file containing
>> Facts to a file containing Dimension data.  The joins are of course, inner,
>> left & outer.
>>
>> I thought I would start simple.  Let's say I've 2 files:
>>
>> 1)  Students:  student_id, course_id, score
>> 2)  Course:  course_id, course_title
>>
>> We want to produce a file that contains:  student_id, course_title, score
>>
>> (Note:  This is a hypothetical case.  The real files have millions of
>> facts & thousands of dimensions)
>>
>> Would something like this work?  Note:  I did say I am a newbie ;)
>>
>> val students = sc.textFile("./students.txt")
>> val courses = sc.textFile("./courses.txt")
>> val s = students.map(x => x.split(','))
>> val left = students.map(x => x.split(',')).map(y => (y(1), y))
>> val right = courses.map(x => x.split(',')).map(y => (y(0), y))
>> val joined = left.join(right)
>>
>>
>> Any pointers in this regard would be greatly appreciated.  Thanks.
>>
>
>

Reply via email to