At the end of a set of computation I have a JavaRDD<String> . I want a
single file where each string is printed in order. The data is small enough
that it is acceptable to handle the printout on a single processor. It may
be large enough that using collect to generate a list might be unacceptable.
the saveAsText command creates multiple files with names like part0000,
part0001 .... This was bed behavior in Hadoop for final output and is also
bad for Spark.
  A more general issue is whether is it possible to convert a JavaRDD into
an iterator or iterable over then entire data set without using collect or
holding all data in memory.
   In many problems where it is desirable to parallelize intermediate steps
but use a single process for handling the final result this could be very
useful.

Reply via email to