I have an RDD which is potentially too large to store in memory with
collect. I want a single task to write the contents as a file to hdfs. Time
is not a large issue but memory is.
I say the following converting my RDD (scans) to a local Iterator. This
works but hasNext shows up as a separate task and takes on the order of 20
sec for a medium sized job -
is *toLocalIterator a bad function to call in this case and is there a
better one?*
*public void writeScores(final Appendable out, JavaRDD<IScoredScan>
scans) { writer.appendHeader(out, getApplication());
Iterator<IScoredScan> scanIterator = scans.toLocalIterator();
while(scanIterator.hasNext()) { IScoredScan scan =
scanIterator.next(); writer.appendScan(out, getApplication(),
scan); } writer.appendFooter(out, getApplication());}*