You can call collect() to pull in the contents of an RDD into the driver:

  val badIPsLines = badIPs.collect()

On Fri, Feb 6, 2015 at 12:19 PM, Jon Gregg <jonrgr...@gmail.com> wrote:

> OK I tried that, but how do I convert an RDD to a Set that I can then
> broadcast and cache?
>
>       val badIPs = sc.textFile("hdfs:///user/jon/"+ "badfullIPs.csv")
>       val badIPsLines = badIPs.getLines
>       val badIpSet = badIPsLines.toSet
>       val badIPsBC = sc.broadcast(badIpSet)
>
> produces the error "value getLines is not a member of
> org.apache.spark.rdd.RDD[String]".
>
> Leaving it as an RDD and then constantly joining I think will be too slow
> for a streaming job.
>
> On Thu, Feb 5, 2015 at 8:06 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
>
>> Hi Jon,
>>
>> You'll need to put the file on HDFS (or whatever distributed filesystem
>> you're running on) and load it from there.
>>
>> -Sandy
>>
>> On Thu, Feb 5, 2015 at 3:18 PM, YaoPau <jonrgr...@gmail.com> wrote:
>>
>>> I have a file "badFullIPs.csv" of bad IP addresses used for filtering.
>>> In
>>> yarn-client mode, I simply read it off the edge node, transform it, and
>>> then
>>> broadcast it:
>>>
>>>       val badIPs = fromFile(edgeDir + "badfullIPs.csv")
>>>       val badIPsLines = badIPs.getLines
>>>       val badIpSet = badIPsLines.toSet
>>>       val badIPsBC = sc.broadcast(badIpSet)
>>>       badIPs.close
>>>
>>> How can I accomplish this in yarn-cluster mode?
>>>
>>> Jon
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-variable-read-from-a-file-in-yarn-cluster-mode-tp21524.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to