How to compute RDD[(String, Set[String])] that include large Set

jagaximo Mon, 19 Jan 2015 19:40:54 -0800

i want compute RDD[(String, Set[String])] that include a part of large size
’Set[String]’.


--------------
val hoge: RDD[(String, Set[String])] = ...
val reduced = hoge.reduceByKey(_ ++ _) //<= create large size Set (shuffle
read size 7GB)
val counted = reduced.map{ case (key, strSeq) => s”$key\t${strSeq.size}"}
counted.saveAsText(“/path/to/save/dir")
----------

Look Spark UI, In stage of saveAsText,  lost executor and starting resubmit.
then spark continue much lost executor.

i think, approach for this problem solving, make ‘RDD[(String,
RDD[String])]’ , union RDD[String], and distinct count. but create RDD in
RDD, NullPointerException has occured. maybe impossible this operation

What might be the issue and possible solution? 

please lend your wisdom






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to compute RDD[(String, Set[String])] that include large Set

Reply via email to