Re: Joining files

Alex Boisvert Wed, 20 Nov 2013 09:10:00 -0800

On Nov 20, 2013 8:34 AM, "Something Something" <[email protected]>
wrote:
>
> Questions:
>
> 1)  I don't see APIs for LEFT, FULL OUTER Joins.  True?


The join operations are so documented here:
http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions

> 2)  Apache Pig provides different join types such as 'replicated',
'skewed'.  Now 'replicated' may not be a concern in Spark 'cause everything
happens in memory (possibly).

Spark gives you the ability to plug in your own partitioner (for shuffles)
which gives you several options to address data skew.

http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.Partitioner

> 3)  Does the 'join' (which seems to work like INNER Join) guarantee
order?  For example, can I assume that columns from the left side will
appear before columns on left & their order will be preserved?

I don't think spark provides this guarantee from a specification
perspective.  It may be that the implementation does but as a matter of
coincidence.  See also pluggable partitioner above.

Seems like a risky assumption to me but would be glad to see the behavior
specified either way.

>
> On a side note, it appears, as of now Spark cannot be used as a
replacement for Pig - without some major coding.  Agree?

I think once you wrap your head around spark, you'll find it to be a more
generally applicable framework... with the caveat that you may need to
compose several operations/features together at times whereas Pig may give
you pre-defined higher-level operations as a single unit of abstraction.

In other words, for some use-cases Pig may afford you more convenience but
I believe spark offers more expressive power and control over your
computation.

(and with the additional caveat that spark is not as mature as pig and
therefore you may run into issues/limitations that the spark developers are
still working on / pending on the roadmap)

>
>
>
>
> On Mon, Nov 18, 2013 at 10:47 PM, Horia <[email protected]> wrote:
>>
>> It seems to me that what you want is the following procedure
>> - parse each file line by line
>> - generate key, value pairs
>> - join by key
>>
>> I think the following should accomplish what you're looking for
>>
>> val students = sc.textFile("./students.txt")    // mapping over this RDD
already maps over lines
>> val courses = sc.textFile("./courses.txt")    // mapping over this RDD
already maps over lines
>> val left = students.map( x => {
>>     columns = x.split(",")
>>     (columns(1), (columns(0), columns(2)))
>> } )
>> val right = courses.map( x => {
>>     columns = x.split(",")
>>     (columns(0), columns(1))
>> } )
>> val joined = left.join(right)
>>
>>
>> The major difference is selectively returning the fields which you
actually want to join, rather than all the fields. A secondary difference
is syntactic: you don't need a .map().map() since you can use a slightly
more complex function block as illustrated. I think Spark is smart enough
to optimize the .map().map() to basically what I've explicitly written...
>>
>> Horia
>>
>>
>>
>> On Mon, Nov 18, 2013 at 10:34 PM, Something Something <
[email protected]> wrote:
>>>
>>> Was my question so dumb?  Or, is this not a good use case for Spark?
>>>
>>>
>>> On Sun, Nov 17, 2013 at 11:41 PM, Something Something <
[email protected]> wrote:
>>>>
>>>> I am a newbie to both Spark & Scala, but I've been working with
Hadoop/Pig for quite some time.
>>>>
>>>> We've quite a few ETL processes running in production that use Pig,
but now we're evaluating Spark to see if they would indeed run faster.
>>>>
>>>> A very common use case in our Pig script is joining a file containing
Facts to a file containing Dimension data.  The joins are of course, inner,
left & outer.
>>>>
>>>> I thought I would start simple.  Let's say I've 2 files:
>>>>
>>>> 1)  Students:  student_id, course_id, score
>>>> 2)  Course:  course_id, course_title
>>>>
>>>> We want to produce a file that contains:  student_id, course_title,
score
>>>>
>>>> (Note:  This is a hypothetical case.  The real files have millions of
facts & thousands of dimensions)
>>>>
>>>> Would something like this work?  Note:  I did say I am a newbie ;)
>>>>
>>>> val students = sc.textFile("./students.txt")
>>>> val courses = sc.textFile("./courses.txt")
>>>> val s = students.map(x => x.split(','))
>>>> val left = students.map(x => x.split(',')).map(y => (y(1), y))
>>>> val right = courses.map(x => x.split(',')).map(y => (y(0), y))
>>>> val joined = left.join(right)
>>>>
>>>>
>>>> Any pointers in this regard would be greatly appreciated.  Thanks.
>>>
>>>
>>
>

Re: Joining files

Reply via email to