Hello guys! First of all, if you want to take a look in a more readable question, take a look in my StackOverflow question <http://stackoverflow.com/questions/33422560/how-to-run-fpgrowth-algorithm-with-a-javapairrdd-object> (I've made the same question there).
I want to test Spark machine learning algorithms and I have some questions on how to run these algorithms with non-native data types. I'm going to run FPGrowth algorithm over the input because I want to get the most frequent itemsets for this input. *My data is disposed as the following:* [timestamp, sensor1value, sensor2value] # id: 0[timestamp, sensor1value, sensor2value] # id: 1[timestamp, sensor1value, sensor2value] # id: 2[timestamp, sensor1value, sensor2value] # id: 3... As I need to use Java (because Python doesn't have a lot of machine learning algorithms from Spark), this data structure isn't very easy to handle / create. *To achieve this data structure in Java I can visualize two approaches:* 1. Use existing Java classes and data types to structure the input (I think some problems can occur in Spark depending on how complex is my data). 2. Create my own class (don't know if it works with Spark algorithms) 1. Existing Java classes and data types In order to do that I've created a* List<Tuple2<Long, List<Double>>>*, so I can keep my data structured and also can create a RDD: List<Tuple2<Long, List<Double>>> algorithm_data = new ArrayList<Tuple2<Long, List<Double>>>(); populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions = sc.parallelizePairs(algorithm_data); I don't feel okay with JavaPairRDD because FPGrowth algorithm seems to be not available for this data structure, as I will show you later in this post. 2. Create my own class I could also create a new class to store the input properly: public class PointValue { private long timestamp; private double sensorMeasure1; private double sensorMeasure2; // Constructor, getters and setters omitted... } However, I don't know if I can do that and still use it with Spark algorithms without any problems (in other words, running Spark algorithms without headaches). I'll focus in the first approach, but if you see that the second one is easier to achieve, please tell me. The solution (based on approach #1): // Initializing SparkSparkConf conf = new SparkConf().setAppName("FP-growth Example");JavaSparkContext sc = new JavaSparkContext(conf); // Getting data for ML algorithmList<Tuple2<Long, List<Double>>> algorithm_data = new ArrayList<Tuple2<Long, List<Double>>>(); populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions = sc.parallelizePairs(algorithm_data); // Running FPGrowthFPGrowth fpg = new FPGrowth().setMinSupport(0.2).setNumPartitions(10);FPGrowthModel<Tuple2<Long, List<Double>>> model = fpg.run(transactions); // Printing everythingfor (FPGrowth.FreqItemset<Tuple2<Long, List<Double>>> itemset: model.freqItemsets().toJavaRDD().collect()) { System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());} But then I got: *The method run(JavaRDD<Basket>) in the type FPGrowth is not applicable for the arguments (JavaPairRDD<Long,List<Double>>)* *What can I do in order to solve my problem (run FPGrowth over JavaPairRDD)?* I'm available to give you more information, just tell me exactly what you need. Thank you! Fernando Paladini