Hi,
I’m quite new to Spark and MR, but have a requirement to get all distinct
values with their respective counts from a transactional file. Let’s assume the
following file format:
0 1 2 3 4 5 6 7
1 3 4 5 8 9
9 10 11 12 13 14 15 16 17 18
1 4 7 11 12 13 19 20
3 4 7 11 15 20 21 22 23
1 2 5 9 11 12 16
Given this, I would like an ArrayList<String, Integer> back, where the String
is the item identifier and the Integer the count of that item identifier in the
file. The following is what I came up with to map the values, but can’t figure
out how to do the counting :(
// create RDD of an arraylist of strings
JavaRDD<ArrayList<String>> transactions = sc.textFile(dataPath).map(
new Function<String, ArrayList<String>>() {
private static final long serialVersionUID = 1L;
@Override
public ArrayList<String> call(String s) {
return Lists.newArrayList(s.split(" "));
}
}
);
Any ideas?
Thanks!
Patrick