ReduceByKey or groupByKey to Count?

dmpour23 Wed, 19 Feb 2014 06:53:26 -0800

My background is mainly in Java and I am newbie when it comes to Spark.

I have a file each line contains the following ccode|custId|orderId
I would like to solve the following Bin Packing Problem and  perform load
balancing based on customID.


My steps are:

1) Read the file.  // JavaRDD<String> input = ctx.textFile(args[1], 1);
2) Split the file based on custId (custId, ccode|custId|orderStuff) // 
JavaPairRDD<String, String> custIdToLine = input.map(new
PairFunction<String, String, String>()
3) Count custId occurence
// should this be a reduce job?
// Can i use groupByKey on "custIdToLine" and use a function within
mapValues() and count the sizes of each set? 

What is the difference?

4) sort by descending
5) Put k largest attributes in each set (K is the number of partitions)
6) Add the rest to the set with the smallest sum

Is there any further documentation /examples anyone can point out part from
the api and default spark examples?

Thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

ReduceByKey or groupByKey to Count?

Reply via email to