Great to hear you got solution!! Cheers! Kevin
On Wed Jan 21 2015 at 11:13:44 AM jagaximo <takuya_seg...@dwango.co.jp> wrote: > Kevin (Sangwoo) Kim wrote > > If keys are not too many, > > You can do like this: > > > > val data = List( > > ("A", Set(1,2,3)), > > ("A", Set(1,2,4)), > > ("B", Set(1,2,3)) > > ) > > val rdd = sc.parallelize(data) > > rdd.persist() > > > > rdd.filter(_._1 == "A").flatMap(_._2).distinct.count > > rdd.filter(_._1 == "B").flatMap(_._2).distinct.count > > rdd.unpersist() > > > > == > > data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1, > > 2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3))) > > rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[ > Int])] > > = ParallelCollectionRDD[6940] at parallelize at > > <console> > > :66 > > res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at > > <console> > > :66 > > res334: Long = 4 > > res335: Long = 3 > > res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at > > <console> > > :66 > > > > Regards, > > Kevin > > Wow, Got it! good solution > Fortunately, I know what keys have large size Set, I was able to adopt this > approach. > > thanks! > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String- > that-include-large-Set-tp21248p21275.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >