You can call collect() to pull in the contents of an RDD into the driver: val badIPsLines = badIPs.collect()
On Fri, Feb 6, 2015 at 12:19 PM, Jon Gregg <jonrgr...@gmail.com> wrote: > OK I tried that, but how do I convert an RDD to a Set that I can then > broadcast and cache? > > val badIPs = sc.textFile("hdfs:///user/jon/"+ "badfullIPs.csv") > val badIPsLines = badIPs.getLines > val badIpSet = badIPsLines.toSet > val badIPsBC = sc.broadcast(badIpSet) > > produces the error "value getLines is not a member of > org.apache.spark.rdd.RDD[String]". > > Leaving it as an RDD and then constantly joining I think will be too slow > for a streaming job. > > On Thu, Feb 5, 2015 at 8:06 PM, Sandy Ryza <sandy.r...@cloudera.com> > wrote: > >> Hi Jon, >> >> You'll need to put the file on HDFS (or whatever distributed filesystem >> you're running on) and load it from there. >> >> -Sandy >> >> On Thu, Feb 5, 2015 at 3:18 PM, YaoPau <jonrgr...@gmail.com> wrote: >> >>> I have a file "badFullIPs.csv" of bad IP addresses used for filtering. >>> In >>> yarn-client mode, I simply read it off the edge node, transform it, and >>> then >>> broadcast it: >>> >>> val badIPs = fromFile(edgeDir + "badfullIPs.csv") >>> val badIPsLines = badIPs.getLines >>> val badIpSet = badIPsLines.toSet >>> val badIPsBC = sc.broadcast(badIpSet) >>> badIPs.close >>> >>> How can I accomplish this in yarn-cluster mode? >>> >>> Jon >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-variable-read-from-a-file-in-yarn-cluster-mode-tp21524.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >