Ian, Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a wrapper around the com.clearspring.analytics.stream.cardinality.HyperLogLogPlus. Cheers <k/>
On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell <i...@ianoconnell.com> wrote: > Depending on your requirements when doing hourly metrics calculating > distinct cardinality, a much more scalable method would be to use a hyper > log log data structure. > a scala impl people have used with spark would be > https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala > > > On Sun, Jun 15, 2014 at 6:16 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> Vivek, >> >> If the foldByKey solution doesn't work for you, my team uses >> RDD.persist(DISK_ONLY) to avoid OOM errors. >> >> It's slower, of course, and requires tuning other config parameters. It >> can also be a problem if you do not have enough disk space, meaning that >> you have to unpersist at the right points if you are running long flows. >> >> For us, even though the disk writes are a performance hit, we prefer the >> Spark programming model to Hadoop M/R. But we are still working on getting >> this to work end to end on 100s of GB of data on our 16-node cluster. >> >> Suren >> >> >> >> On Sun, Jun 15, 2014 at 12:08 AM, Vivek YS <vivek...@gmail.com> wrote: >> >>> Thanks for the input. I will give foldByKey a shot. >>> >>> The way I am doing is, data is partitioned hourly. So I am computing >>> distinct values hourly. Then I use unionRDD to merge them and compute >>> distinct on the overall data. >>> >>> > Is there a way to know which key,value pair is resulting in the OOM ? >>> > Is there a way to set parallelism in the map stage so that, each >>> worker will process one key at time. ? >>> >>> I didn't realise countApproxDistinctByKey is using hyperloglogplus. >>> This should be interesting. >>> >>> --Vivek >>> >>> >>> On Sat, Jun 14, 2014 at 11:37 PM, Sean Owen <so...@cloudera.com> wrote: >>> >>>> Grouping by key is always problematic since a key might have a huge >>>> number of values. You can do a little better than grouping *all* values and >>>> *then* finding distinct values by using foldByKey, putting values into a >>>> Set. At least you end up with only distinct values in memory. (You don't >>>> need two maps either, right?) >>>> >>>> If the number of distinct values is still huge for some keys, consider >>>> the experimental method countApproxDistinctByKey: >>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L285 >>>> >>>> This should be much more performant at the cost of some accuracy. >>>> >>>> >>>> On Sat, Jun 14, 2014 at 1:58 PM, Vivek YS <vivek...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> For last couple of days I have been trying hard to get around this >>>>> problem. Please share any insights on solving this problem. >>>>> >>>>> Problem : >>>>> There is a huge list of (key, value) pairs. I want to transform this >>>>> to (key, distinct values) and then eventually to (key, distinct values >>>>> count) >>>>> >>>>> On small dataset >>>>> >>>>> groupByKey().map( x => (x_1, x._2.distinct)) ...map(x => (x_1, >>>>> x._2.distinct.count)) >>>>> >>>>> On large data set I am getting OOM. >>>>> >>>>> Is there a way to represent Seq of values from groupByKey as RDD and >>>>> then perform distinct over it ? >>>>> >>>>> Thanks >>>>> Vivek >>>>> >>>> >>>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >> W: www.velos.io >> >> >