Re: Where is reduceByKey?

Philip Ogren Thu, 07 Nov 2013 15:47:56 -0800

Thanks - I think this would be a helpful note to add to the docs. Iwent and read a few things about Scala implicit conversions (I'mobviously new to the language) and it seems like a very powerfullanguage feature and now that I know about them it will certainly beeasy to identify when they are missing (i.e. the first thing to suspectwhen you see a "not a member" compilation message.) I'm still a bitmystified as to how you would go about finding the appropriate importsexcept that I suppose you aren't very likely to use methods that youdon't already know about! Unless you are copying code verbatim thatdoesn't have the necessary import statements....


On 11/7/2013 4:05 PM, Matei Zaharia wrote:

Yeah, this is confusing and unfortunately as far as I know it’s APIspecific. Maybe we should add this to the documentation page for RDD.
The reason for these conversions is to only allow some operationsbased on the underlying data type of the collection. For example,Scala collections support sum() as long as they contain numeric types.That’s fine for the Scala collection library since its conversions areimported by default, but I guess it makes it confusing for third-partyapps.
Matei
On Nov 7, 2013, at 1:15 PM, Philip Ogren <[email protected]<mailto:[email protected]>> wrote:
I remember running into something very similar when trying to performa foreach on java.util.List and I fixed it by adding the followingimport:
import scala.collection.JavaConversions._
And my foreach loop magically compiled - presumably due to a anotherimplicit conversion. Now this is the second time I've run into thisproblem and I didn't recognize it. I'm not sure that I would knowwhat to do the next time I run into this. Do you have some advice onhow I should have recognized a missing import that provides implicitconversions and how I would know what to import? This strikes me ascode obfuscation. I guess this is more of a Scala question....
Thanks,
Philip



On 11/7/2013 2:01 PM, Josh Rosen wrote:
The additional methods on RDDs of pairs are defined in a classcalled PairRDDFunctions(https://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions).SparkContext provides an implicit conversion from RDD[T] toPairRDDFunctions[T] to make this transparent to users.
To import those implicit conversions, use

    import org.apache.spark.SparkContext._
These conversions are automatically imported by Spark Shell, butyou'll have to import them yourself in standalone programs.
On Thu, Nov 7, 2013 at 11:54 AM, Philip Ogren<[email protected] <mailto:[email protected]>> wrote:
    On the front page <http://spark.incubator.apache.org/> of the
    Spark website there is the following simple word count
    implementation:

    file = spark.textFile("hdfs://...")
    file.flatMap(line => line.split(" ")).map(word => (word,
    1)).reduceByKey(_ + _)

    The same code can be found in the Quick Start
    <http://spark.incubator.apache.org/docs/latest/quick-start.html>
    quide.  When I follow the steps in my spark-shell (version
    0.8.0) it works fine.  The reduceByKey method is also shown in
    the list of transformations
    
<http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#transformations>
    in the Spark Programming Guide.  The bottom of this list directs
    the reader to the API docs for the class RDD (this link is
    broken, BTW). The API docs for RDD
    
<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD>
    does not list a reduceByKey method for RDD.  Also, when I try to
    compile the above code in a Scala class definition I get the
    following compile error:

    value reduceByKey is not a member of
    org.apache.spark.rdd.RDD[(java.lang.String, Int)]

    I am compiling with maven using the following dependency definition:

            <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.9.3</artifactId>
    <version>0.8.0-incubating</version>
            </dependency>

    Can someone help me understand why this code works fine from the
    spark-shell but doesn't seem to exist in the API docs and won't
    compile?

    Thanks,
    Philip

Re: Where is reduceByKey?

Reply via email to