Re: map vs mapPartitions

Daniel Darabos Thu, 25 Jun 2015 06:12:01 -0700

Spark creates a RecordReader and uses next() on it when you call
input.next(). (See
https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L215)
How
the RecordReader works is an HDFS question, but it's safe to say there is
no difference between using map and mapPartitions.


On Thu, Jun 25, 2015 at 1:28 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> say source is HDFS,And file is divided in 10 partitions. so what will be
>  input contains.
>
> public Iterable<Integer> call(Iterator<String> input)
>
> say I have 10 executors in job each having single partition.
>
> will it have some part of partition or complete. And if some when I call
> input.next() - it will fetch rest or how is it handled ?
>
>
>
>
>
> On Thu, Jun 25, 2015 at 3:11 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> No, or at least, it depends on how the source of the partitions was
>> implemented.
>>
>> On Thu, Jun 25, 2015 at 12:16 PM, Shushant Arora
>> <shushantaror...@gmail.com> wrote:
>> > Does mapPartitions keep complete partitions in memory of executor as
>> > iterable.
>> >
>> > JavaRDD<String> rdd = jsc.textFile("path");
>> > JavaRDD<Integer> output = rdd.mapPartitions(new
>> > FlatMapFunction<Iterator<String>, Integer>() {
>> >
>> > public Iterable<Integer> call(Iterator<String> input)
>> > throws Exception {
>> > List<Integer> output = new ArrayList<Integer>();
>> > while(input.hasNext()){
>> > output.add(input.next().length());
>> > }
>> > return output;
>> > }
>> >
>> > });
>> >
>> >
>> > Here does input is present in memory and can contain complete partition
>> of
>> > gbs ?
>> > Will this function call(Iterator<String> input) is called only for no of
>> > partitions(say if I have 10 in this example) times. Not no of lines
>> > times(say 10000000) .
>> >
>> >
>> > And whats the use of mapPartitionsWithIndex ?
>> >
>> > Thanks
>> >
>>
>
>

Re: map vs mapPartitions

Reply via email to