And got the first cut:
val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size))
gives the total & unique.
The question : is it scalable & efficient ? Would appreciate insights.
Cheers
<k/>
On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar <[email protected]>
wrote:
> Answered one of my questions (#5) : val pairs = new
> PairRDDFunctions(<RDD>) works fine locally. Now I can do groupByKey et al.
> Am not sure if it is scalable for millions of records & memory efficient.
> heers
> <k/>
>
>
> On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar <[email protected]>
> wrote:
>
>> Hi,
>> Would appreciate insights and wisdom on a problem we are working on:
>>
>> 1. Context:
>> - Given a csv file like:
>> - d1,c1,a1
>> - d1,c1,a2
>> - d1,c2,a1
>> - d1,c1,a1
>> - d2,c1,a3
>> - d2,c2,a1
>> - d3,c1,a1
>> - d3,c3,a1
>> - d3,c2,a1
>> - d3,c3,a2
>> - d5,c1,a3
>> - d5,c2,a2
>> - d5,c3,a2
>> - Want to find uniques and totals (of the d_ across the c_ and a_
>> dimensions):
>> - Tot Unique
>> - c1 6 4
>> - c2 4 4
>> - c3 2 2
>> - a1 7 3
>> - a2 4 3
>> - a3 2 2
>> - c1-a1 ...
>> - c1-a2 ...
>> - c1-a3 ...
>> - c2-a1 ...
>> - c2-a2 ...
>> - ...
>> - c3-a3
>> - Obviously there are millions of records and more
>> attributes/dimensions. So scalability is key
>> 2. We think Spark is a good stack for this problem: Have a few
>> questions:
>> 3. From a Spark substrate perspective, what are some of the optimum
>> transformations & things to watch out for ?
>> 4. Is PairRDD the best data representation ? GroupByKey et al is only
>> available for PairRDD.
>> 5. On a pragmatic level, file.map().map() results in RDD. How do I
>> transform it to a PairRDD ?
>> 1. .map(fields => (fields(1), fields(0)) - results in Unit
>> 2. .map(fields => fields(1) -> fields(0)) also is not working
>> 3. Both these do not result in a PairRDD
>> 4. Am missing something fundamental.
>>
>> Cheers & Have a nice weekend
>> <k/>
>>
>
>