Re: Multi-dimensional Uniques over large dataset

Krishna Sankar Fri, 13 Jun 2014 23:48:08 -0700

And got the first cut:

    val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size))
gives the total & unique.


The question : is it scalable & efficient ? Would appreciate insights.

Cheers

<k/>


On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar <[email protected]>
wrote:

> Answered one of my questions (#5) : val pairs = new
> PairRDDFunctions(<RDD>) works fine locally. Now I can do groupByKey et al.
> Am not sure if it is scalable for millions of records & memory efficient.
> heers
> <k/>
>
>
> On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar <[email protected]>
> wrote:
>
>> Hi,
>>    Would appreciate insights and wisdom on a problem we are working on:
>>
>>    1. Context:
>>       - Given a csv file like:
>>       - d1,c1,a1
>>       - d1,c1,a2
>>       - d1,c2,a1
>>       - d1,c1,a1
>>       - d2,c1,a3
>>       - d2,c2,a1
>>       - d3,c1,a1
>>       - d3,c3,a1
>>       - d3,c2,a1
>>       - d3,c3,a2
>>       - d5,c1,a3
>>       - d5,c2,a2
>>        - d5,c3,a2
>>       - Want to find uniques and totals (of the d_ across the c_ and a_
>>       dimensions):
>>       -         Tot   Unique
>>          - c1      6      4
>>          - c2      4      4
>>          - c3      2      2
>>          - a1      7      3
>>          - a2      4      3
>>          - a3      2      2
>>          - c1-a1  ...
>>          - c1-a2 ...
>>          - c1-a3 ...
>>          - c2-a1 ...
>>          - c2-a2 ...
>>          - ...
>>          - c3-a3
>>       - Obviously there are millions of records and more
>>       attributes/dimensions. So scalability is key
>>       2. We think Spark is a good stack for this problem: Have a few
>>    questions:
>>    3. From a Spark substrate perspective, what are some of the optimum
>>    transformations & things to watch out for ?
>>    4. Is PairRDD the best data representation ? GroupByKey et al is only
>>    available for PairRDD.
>>    5. On a pragmatic level, file.map().map() results in RDD. How do I
>>    transform it to a PairRDD ?
>>       1. .map(fields => (fields(1), fields(0)) - results in Unit
>>       2. .map(fields => fields(1) -> fields(0)) also is not working
>>       3. Both these do not result in a PairRDD
>>       4. Am missing something fundamental.
>>
>> Cheers & Have a nice weekend
>> <k/>
>>
>
>

Re: Multi-dimensional Uniques over large dataset

Reply via email to