Hi Old-Scool,
For the first question, you can specify the number of partition in any DataFrame by using repartition(numPartitions: Int, partitionExprs: Column*). Example: val partitioned = data.repartition(numPartitions=10).cache() For your second question, you can transform your RDD into PairRDD and use reduceByKey() Example: val pairs = data.map(row => (row(1), row(2)).reduceByKey(_+_) Best, Khwunchai Jaengsawang Email: khwuncha...@ku.th LinkedIn <https://linkedin.com/in/khwunchai> | Github <https://github.com/khwunchai> > On Mar 4, 2560 BE, at 8:59 PM, Old-School <giorgos_myrianth...@outlook.com> > wrote: > > Hi, > > I want to perform some simple transformations and check the execution time, > under various configurations (e.g. number of cores being used, number of > partitions etc). Since it is not possible to set the partitions of a > dataframe , I guess that I should probably use RDDs. > > I've got a dataset with 3 columns as shown below: > > val data = file.map(line => line.split(" ")) > .filter(lines => lines.length == 3) // ignore first line > .map(row => (row(0), row(1), row(2))) > .toDF("ID", "word-ID", "count") > results in: > > +------+------------+---------+ > | ID | word-ID | count | > +------+------------+---------+ > | 15 | 87 | 151 | > | 20 | 19 | 398 | > | 15 | 19 | 21 | > | 180 | 90 | 190 | > +-------------------+---------+ > So how can I turn the above into an RDD in order to use e.g. > sc.parallelize(data, 10) and set the number of partitions to say 10? > > Furthermore, I would also like to ask about the equivalent expression (using > RDD API) for the following simple transformation: > > data.select("word-ID", > "count").groupBy("word-ID").agg(sum($"count").as("count")).show() > > > > Thanks in advance > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >