Spark creates a RecordReader and uses next() on it when you call input.next(). (See https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L215) How the RecordReader works is an HDFS question, but it's safe to say there is no difference between using map and mapPartitions.
On Thu, Jun 25, 2015 at 1:28 PM, Shushant Arora <shushantaror...@gmail.com> wrote: > say source is HDFS,And file is divided in 10 partitions. so what will be > input contains. > > public Iterable<Integer> call(Iterator<String> input) > > say I have 10 executors in job each having single partition. > > will it have some part of partition or complete. And if some when I call > input.next() - it will fetch rest or how is it handled ? > > > > > > On Thu, Jun 25, 2015 at 3:11 PM, Sean Owen <so...@cloudera.com> wrote: > >> No, or at least, it depends on how the source of the partitions was >> implemented. >> >> On Thu, Jun 25, 2015 at 12:16 PM, Shushant Arora >> <shushantaror...@gmail.com> wrote: >> > Does mapPartitions keep complete partitions in memory of executor as >> > iterable. >> > >> > JavaRDD<String> rdd = jsc.textFile("path"); >> > JavaRDD<Integer> output = rdd.mapPartitions(new >> > FlatMapFunction<Iterator<String>, Integer>() { >> > >> > public Iterable<Integer> call(Iterator<String> input) >> > throws Exception { >> > List<Integer> output = new ArrayList<Integer>(); >> > while(input.hasNext()){ >> > output.add(input.next().length()); >> > } >> > return output; >> > } >> > >> > }); >> > >> > >> > Here does input is present in memory and can contain complete partition >> of >> > gbs ? >> > Will this function call(Iterator<String> input) is called only for no of >> > partitions(say if I have 10 in this example) times. Not no of lines >> > times(say 10000000) . >> > >> > >> > And whats the use of mapPartitionsWithIndex ? >> > >> > Thanks >> > >> > >