Hi, Would appreciate insights and wisdom on a problem we are working on:
1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2 - d5,c1,a3 - d5,c2,a2 - d5,c3,a2 - Want to find uniques and totals (of the d_ across the c_ and a_ dimensions): - Tot Unique - c1 6 4 - c2 4 4 - c3 2 2 - a1 7 3 - a2 4 3 - a3 2 2 - c1-a1 ... - c1-a2 ... - c1-a3 ... - c2-a1 ... - c2-a2 ... - ... - c3-a3 - Obviously there are millions of records and more attributes/dimensions. So scalability is key 2. We think Spark is a good stack for this problem: Have a few questions: 3. From a Spark substrate perspective, what are some of the optimum transformations & things to watch out for ? 4. Is PairRDD the best data representation ? GroupByKey et al is only available for PairRDD. 5. On a pragmatic level, file.map().map() results in RDD. How do I transform it to a PairRDD ? 1. .map(fields => (fields(1), fields(0)) - results in Unit 2. .map(fields => fields(1) -> fields(0)) also is not working 3. Both these do not result in a PairRDD 4. Am missing something fundamental. Cheers & Have a nice weekend <k/>