My background is mainly in Java and I am newbie when it comes to Spark. I have a file each line contains the following ccode|custId|orderId I would like to solve the following Bin Packing Problem and perform load balancing based on customID.
My steps are: 1) Read the file. // JavaRDD<String> input = ctx.textFile(args[1], 1); 2) Split the file based on custId (custId, ccode|custId|orderStuff) // JavaPairRDD<String, String> custIdToLine = input.map(new PairFunction<String, String, String>() 3) Count custId occurence // should this be a reduce job? // Can i use groupByKey on "custIdToLine" and use a function within mapValues() and count the sizes of each set? What is the difference? 4) sort by descending 5) Put k largest attributes in each set (K is the number of partitions) 6) Add the rest to the set with the smallest sum Is there any further documentation /examples anyone can point out part from the api and default spark examples? Thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765.html Sent from the Apache Spark User List mailing list archive at Nabble.com.