Re: How to compute RDD[(String, Set[String])] that include large Set

Kevin (Sangwoo) Kim Tue, 20 Jan 2015 18:21:07 -0800

Great to hear you got solution!!
Cheers!

Kevin


On Wed Jan 21 2015 at 11:13:44 AM jagaximo <takuya_seg...@dwango.co.jp>
wrote:

> Kevin (Sangwoo) Kim wrote
> > If keys are not too many,
> > You can do like this:
> >
> > val data = List(
> >   ("A", Set(1,2,3)),
> >   ("A", Set(1,2,4)),
> >   ("B", Set(1,2,3))
> > )
> > val rdd = sc.parallelize(data)
> > rdd.persist()
> >
> > rdd.filter(_._1 == "A").flatMap(_._2).distinct.count
> > rdd.filter(_._1 == "B").flatMap(_._2).distinct.count
> > rdd.unpersist()
> >
> > ==
> > data: List[(String, scala.collection.mutable.Set[Int])] = List((A,Set(1,
> > 2, 3)), (A,Set(1, 2, 4)), (B,Set(1, 2, 3)))
> > rdd: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.Set[
> Int])]
> > = ParallelCollectionRDD[6940] at parallelize at
> > <console>
> > :66
> > res332: rdd.type = ParallelCollectionRDD[6940] at parallelize at
> > <console>
> > :66
> > res334: Long = 4
> > res335: Long = 3
> > res336: rdd.type = ParallelCollectionRDD[6940] at parallelize at
> > <console>
> > :66
> >
> > Regards,
> > Kevin
>
> Wow, Got it! good solution
> Fortunately, I know what keys have large size Set, I was able to adopt this
> approach.
>
> thanks!
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-
> that-include-large-Set-tp21248p21275.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to compute RDD[(String, Set[String])] that include large Set

Reply via email to