Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-20 Thread jagaximo
Kevin (Sangwoo) Kim wrote If keys are not too many, You can do like this: val data = List( (A, Set(1,2,3)), (A, Set(1,2,4)), (B, Set(1,2,3)) ) val rdd = sc.parallelize(data) rdd.persist() rdd.filter(_._1 == A).flatMap(_._2).distinct.count rdd.filter(_._1 ==

How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread jagaximo
i want compute RDD[(String, Set[String])] that include a part of large size ’Set[String]’. -- val hoge: RDD[(String, Set[String])] = ... val reduced = hoge.reduceByKey(_ ++ _) //= create large size Set (shuffle read size 7GB) val counted = reduced.map{ case (key, strSeq) =

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread jagaximo
That i want to do, get unique count for each key. so take map() or countByKey(), not get unique count. (because duplicate string is likely to be counted)... -- View this message in context: