Sean Owen created SPARK-3369:
--------------------------------

             Summary: Java mapPartitions Iterator->Iterable is inconsistent 
with Scala's Iterator->Iterator
                 Key: SPARK-3369
                 URL: https://issues.apache.org/jira/browse/SPARK-3369
             Project: Spark
          Issue Type: Improvement
          Components: Java API
    Affects Versions: 1.0.2
            Reporter: Sean Owen


{{mapPartitions}} in the Scala RDD API takes a function that transforms an 
{{Iterator}} to an {{Iterator}}: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an 
{{Iterator}} but is requires to return an {{Iterable}}, which is a stronger 
condition and appears inconsistent. It's a problematic inconsistent though 
because this seems to require copying all of the input into memory in order to 
create an object that can be iterated many times, since the input does not 
afford this itself.

Similarity for other {{mapPartitions*}} methods and other 
{{*FlatMapFunctions}}s in Java.

(Is there a reason for this difference that I'm overlooking?)

If I'm right that this was inadvertent inconsistency, then the big issue here 
is that of course this is part of a public API. Workarounds I can think of:

Promise that Spark will only call {{iterator()}} once, so implementors can use 
a hacky {{IteratorIterable}} that returns the same {{Iterator}}.

Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to