Re: Mapping Hadoop Reduce to Spark

Lukas Nalezenec Fri, 05 Sep 2014 01:53:38 -0700

Hi,

FYI: There is bug in Java mapPartitions - SPARK-3369<https://issues.apache.org/jira/browse/SPARK-3369>. In Java results frommapPartitions and similar functions must fit in memory. Look at examplebelow - it returns List.


Lukas


On 1.9.2014 00:50, Matei Zaharia wrote:

mapPartitions just gives you an Iterator of the values in eachpartition, and lets you return an Iterator of outputs. For instance,take a look athttps://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java#L694.


Matei

On August 31, 2014 at 12:26:51 PM, Steve Lewis (lordjoe2...@gmail.com<mailto:lordjoe2...@gmail.com>) wrote:

Is there a sample of how to do this -
I see 1.1 is out but cannot find samples of mapPartitions
A Java sample would be very useful

On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia<matei.zaha...@gmail.com <mailto:matei.zaha...@gmail.com>> wrote:


    In 1.1, you'll be able to get all of these properties using
    sortByKey, and then mapPartitions on top to iterate through the
    key-value pairs. Unfortunately sortByKey does not let you control
    the Partitioner, but it's fairly easy to write your own version
    that does if this is important.

    In previous versions, the values for each key had to fit in
    memory (though we could have data on disk across keys), and this
    is still true for groupByKey, cogroup and join. Those
    restrictions will hopefully go away in a later release. But
    sortByKey + mapPartitions lets you just iterate through the
    key-value pairs without worrying about this.

    Matei

    On August 30, 2014 at 9:04:37 AM, Steve Lewis
    (lordjoe2...@gmail.com <mailto:lordjoe2...@gmail.com>) wrote:

    When programming in Hadoop it is possible to guarantee
    1) All keys sent to a specific partition will be handled by the
    same machine (thread)
    2) All keys received by a specific machine (thread) will be
    received in sorted order
    3) These conditions will hold even if the values associated with
    a specific key are too large enough to fit in memory.

    In my Hadoop code I use all of these conditions - specifically
    with my larger data sets the size of data I wish to group
    exceeds the available memory.

    I think I understand the operation of groupby but my
    understanding is that this requires that the results for a
    single key, and perhaps all keys fit on a single machine.

    Is there away to perform like Hadoop ad not require that an
    entire group fir in memory?




--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Mapping Hadoop Reduce to Spark

Reply via email to