i want compute RDD[(String, Set[String])] that include a part of large size ’Set[String]’.
-------------- val hoge: RDD[(String, Set[String])] = ... val reduced = hoge.reduceByKey(_ ++ _) //<= create large size Set (shuffle read size 7GB) val counted = reduced.map{ case (key, strSeq) => s”$key\t${strSeq.size}"} counted.saveAsText(“/path/to/save/dir") ---------- Look Spark UI, In stage of saveAsText, lost executor and starting resubmit. then spark continue much lost executor. i think, approach for this problem solving, make ‘RDD[(String, RDD[String])]’ , union RDD[String], and distinct count. but create RDD in RDD, NullPointerException has occured. maybe impossible this operation What might be the issue and possible solution? please lend your wisdom -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-large-Set-tp21248.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org