Hello guys!

First of all, if you want to take a look in a more readable question, take
a look in my StackOverflow question
<http://stackoverflow.com/questions/33422560/how-to-run-fpgrowth-algorithm-with-a-javapairrdd-object>
(I've made the same question there).

I want to test Spark machine learning algorithms and I have some questions
on how to run these algorithms with non-native data types. I'm going to run
FPGrowth algorithm over the input because I want to get the most frequent
itemsets for this input.

*My data is disposed as the following:*

[timestamp, sensor1value, sensor2value] # id: 0[timestamp,
sensor1value, sensor2value] # id: 1[timestamp, sensor1value,
sensor2value] # id: 2[timestamp, sensor1value, sensor2value] # id:
3...

As I need to use Java (because Python doesn't have a lot of machine
learning algorithms from Spark), this data structure isn't very easy to
handle / create.

*To achieve this data structure in Java I can visualize two approaches:*

   1. Use existing Java classes and data types to structure the input (I
   think some problems can occur in Spark depending on how complex is my data).
   2. Create my own class (don't know if it works with Spark algorithms)

1. Existing Java classes and data types

In order to do that I've created a* List<Tuple2<Long, List<Double>>>*, so I
can keep my data structured and also can create a RDD:

List<Tuple2<Long, List<Double>>> algorithm_data = new
ArrayList<Tuple2<Long, List<Double>>>();
populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions
= sc.parallelizePairs(algorithm_data);

I don't feel okay with JavaPairRDD because FPGrowth algorithm seems to
be not available for this data structure, as I will show you later in
this post.

2. Create my own class

I could also create a new class to store the input properly:

public class PointValue {

    private long timestamp;
    private double sensorMeasure1;
    private double sensorMeasure2;

    // Constructor, getters and setters omitted...
}

However, I don't know if I can do that and still use it with Spark
algorithms without any problems (in other words, running Spark algorithms
without headaches). I'll focus in the first approach, but if you see that
the second one is easier to achieve, please tell me.
The solution (based on approach #1):

// Initializing SparkSparkConf conf = new
SparkConf().setAppName("FP-growth Example");JavaSparkContext sc = new
JavaSparkContext(conf);
// Getting data for ML algorithmList<Tuple2<Long, List<Double>>>
algorithm_data = new ArrayList<Tuple2<Long, List<Double>>>();
populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions
= sc.parallelizePairs(algorithm_data);
// Running FPGrowthFPGrowth fpg = new
FPGrowth().setMinSupport(0.2).setNumPartitions(10);FPGrowthModel<Tuple2<Long,
List<Double>>> model = fpg.run(transactions);
// Printing everythingfor (FPGrowth.FreqItemset<Tuple2<Long,
List<Double>>> itemset: model.freqItemsets().toJavaRDD().collect()) {
    System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());}

But then I got:

*The method run(JavaRDD<Basket>) in the type FPGrowth is not
applicable for the arguments (JavaPairRDD<Long,List<Double>>)*

*What can I do in order to solve my problem (run FPGrowth over
JavaPairRDD)?*

I'm available to give you more information, just tell me exactly what you
need.
Thank you!
Fernando Paladini

Reply via email to