Re: Joining files

Alex Boisvert Wed, 20 Nov 2013 09:18:05 -0800

On Nov 20, 2013 8:34 AM, "Something Something" <[email protected]>
wrote:
>
> Questions:
>
> 1)  I don't see APIs for LEFT, FULL OUTER Joins.  True?
> 2)  Apache Pig provides different join types such as 'replicated',
'skewed'.  Now 'replicated' may not be a concern in Spark 'cause everything
happens in memory (possibly).
> 3)  Does the 'join' (which seems to work like INNER Join) guarantee
order?  For example, can I assume that columns from the left side will
appear before columns on left & their order will be preserved?


Sorry I misunderstood your question in my prior email... I somehow thought
you were talking about row order.

Column order is preserved, yes. Take a look at the join method signatures
and their types.

And spark isn't limited to "columns" but rather any user-defined type (that
is serializable).

>
> On a side note, it appears, as of now Spark cannot be used as a
replacement for Pig - without some major coding.  Agree?
>
>
>
>
> On Mon, Nov 18, 2013 at 10:47 PM, Horia <[email protected]> wrote:
>>
>> It seems to me that what you want is the following procedure
>> - parse each file line by line
>> - generate key, value pairs
>> - join by key
>>
>> I think the following should accomplish what you're looking for
>>
>> val students = sc.textFile("./students.txt")    // mapping over this RDD
already maps over lines
>> val courses = sc.textFile("./courses.txt")    // mapping over this RDD
already maps over lines
>> val left = students.map( x => {
>>     columns = x.split(",")
>>     (columns(1), (columns(0), columns(2)))
>> } )
>> val right = courses.map( x => {
>>     columns = x.split(",")
>>     (columns(0), columns(1))
>> } )
>> val joined = left.join(right)
>>
>>
>> The major difference is selectively returning the fields which you
actually want to join, rather than all the fields. A secondary difference
is syntactic: you don't need a .map().map() since you can use a slightly
more complex function block as illustrated. I think Spark is smart enough
to optimize the .map().map() to basically what I've explicitly written...
>>
>> Horia
>>
>>
>>
>> On Mon, Nov 18, 2013 at 10:34 PM, Something Something <
[email protected]> wrote:
>>>
>>> Was my question so dumb?  Or, is this not a good use case for Spark?
>>>
>>>
>>> On Sun, Nov 17, 2013 at 11:41 PM, Something Something <
[email protected]> wrote:
>>>>
>>>> I am a newbie to both Spark & Scala, but I've been working with
Hadoop/Pig for quite some time.
>>>>
>>>> We've quite a few ETL processes running in production that use Pig,
but now we're evaluating Spark to see if they would indeed run faster.
>>>>
>>>> A very common use case in our Pig script is joining a file containing
Facts to a file containing Dimension data.  The joins are of course, inner,
left & outer.
>>>>
>>>> I thought I would start simple.  Let's say I've 2 files:
>>>>
>>>> 1)  Students:  student_id, course_id, score
>>>> 2)  Course:  course_id, course_title
>>>>
>>>> We want to produce a file that contains:  student_id, course_title,
score
>>>>
>>>> (Note:  This is a hypothetical case.  The real files have millions of
facts & thousands of dimensions)
>>>>
>>>> Would something like this work?  Note:  I did say I am a newbie ;)
>>>>
>>>> val students = sc.textFile("./students.txt")
>>>> val courses = sc.textFile("./courses.txt")
>>>> val s = students.map(x => x.split(','))
>>>> val left = students.map(x => x.split(',')).map(y => (y(1), y))
>>>> val right = courses.map(x => x.split(',')).map(y => (y(0), y))
>>>> val joined = left.join(right)
>>>>
>>>>
>>>> Any pointers in this regard would be greatly appreciated.  Thanks.
>>>
>>>
>>
>

Re: Joining files

Reply via email to