Re: How to broadcast a variable read from a file in yarn-cluster mode?

Jon Gregg Tue, 10 Feb 2015 10:49:56 -0800

OK that worked and getting close here ... the job ran successfully for a
bit and I got output for the first couple buckets before getting a
"java.lang.Exception: Could not compute split, block input-0-1423593163000
not found" error.


So I bumped up the memory at the command line from 2 gb to 5 gb, ran it
again ... this time I got around 8 successful outputs before erroring.

Bumped up the memory from 5 gb to 10 gb ... got around 15 successful
outputs before erroring.


I'm not persisting or caching anything except for the broadcast IP table
and another broadcast small user agents list used for the same type of
filtering, and both files are tiny.  The Hadoop cluster is nearly empty
right now and has more than enough available memory to handle this job.  I
am connecting to Kafka as well and so there's a lot of data coming through
as my index is trying to catch up to the current date, but yarn-client mode
has several times in the past few weeks been able to catch up to the
current date and run successfully for days without issue.

My guess is memory isn't being cleared after each bucket?  Relevant portion
of the log below.


15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593134400 on
phd40010023.na.com:55551 in memory (size: 50.1 MB, free: 10.2 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593135400 on
phd40010023.na.com:55551 in memory (size: 24.9 MB, free: 10.2 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593136400 on
phd40010023.na.com:55551 in memory (size: 129.0 MB, free: 10.3 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593137000 on
phd40010023.na.com:55551 in memory (size: 112.4 MB, free: 10.4 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593137200 on
phd40010023.na.com:55551 in memory (size: 481.0 B, free: 10.4 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593137800 on
phd40010023.na.com:55551 in memory (size: 44.6 MB, free: 10.5 GB)
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 754 from persistence list
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593138400 on
phd40010023.na.com:55551 in memory (size: 95.8 MB, free: 10.6 GB)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 754
15/02/10 13:34:54 INFO MapPartitionsRDD: Removing RDD 753 from persistence
list
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593138600 on
phd40010023.na.com:55551 in memory (size: 123.2 MB, free: 10.7 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593138800 on
phd40010023.na.com:55551 in memory (size: 5.2 KB, free: 10.7 GB)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 753
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 750 from persistence list
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593139600 on
phd40010023.na.com:55551 in memory (size: 106.4 MB, free: 10.8 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593139800 on
phd40010023.na.com:55551 in memory (size: 107.0 MB, free: 10.9 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593140000 on
phd40010023.na.com:55551 in memory (size: 59.5 MB, free: 10.9 GB)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 750
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 749 from persistence list
15/02/10 13:34:54 INFO DAGScheduler: Missing parents: List(Stage 117, Stage
114, Stage 115, Stage 116)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593140200 on
phd40010023.na.com:55551 in memory (size: 845.0 B, free: 10.9 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593140600 on
phd40010023.na.com:55551 in memory (size: 19.2 MB, free: 11.0 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593140800 on
phd40010023.na.com:55551 in memory (size: 492.0 B, free: 11.0 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593141000 on
phd40010023.na.com:55551 in memory (size: 1018.0 B, free: 11.0 GB)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 749
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593142400 on
phd40010023.na.com:55551 in memory (size: 48.6 MB, free: 11.0 GB)
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 767 from persistence list
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593142600 on
phd40010023.na.com:55551 in memory (size: 4.9 KB, free: 11.0 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593142800 on
phd40010023.na.com:55551 in memory (size: 780.0 B, free: 11.0 GB)
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593143800 on
phd40010023.na.com:55551 in memory (size: 847.0 B, free: 11.0 GB)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 767
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593144400 on
phd40010023.na.com:55551 in memory (size: 43.7 MB, free: 11.1 GB)
15/02/10 13:34:54 INFO MapPartitionsRDD: Removing RDD 766 from persistence
list
15/02/10 13:34:54 INFO BlockManager: Removing RDD 766
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 763 from persistence list
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593144600 on
phd40010023.na.com:55551 in memory (size: 827.0 B, free: 11.1 GB)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 763
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 762 from persistence list
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593144800 on
phd40010023.na.com:55551 in memory (size: 1509.0 B, free: 11.1 GB)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 762
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 761 from persistence list
15/02/10 13:34:54 INFO DAGScheduler: Submitting Stage 114
(MapPartitionsRDD[836] at combineByKey at ShuffledDStream.scala:42), which
has no missing parents
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593147400 on
phd40010023.na.com:55551 in memory (size: 94.2 MB, free: 11.1 GB)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 761
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 760 from persistence list
15/02/10 13:34:54 INFO BlockManagerInfo: Removed input-0-1423593147600 on
phd40010023.na.com:55551 in memory (size: 75.8 MB, free: 11.2 GB)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 760
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 759 from persistence list
15/02/10 13:34:54 INFO BlockManager: Removing RDD 759
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 758 from persistence list
15/02/10 13:34:54 INFO BlockManager: Removing RDD 758
15/02/10 13:34:54 INFO DAGScheduler: Submitting 10 missing tasks from Stage
114 (MapPartitionsRDD[836] at combineByKey at ShuffledDStream.scala:42)
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 757 from persistence list
15/02/10 13:34:54 INFO YarnClusterScheduler: Adding task set 114.0 with 10
tasks
15/02/10 13:34:54 INFO BlockManager: Removing RDD 757
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 756 from persistence list
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:0 as TID 2417 on
executor 4: phd40010023.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 756
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:0 as 2024
bytes in 0 ms
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:1 as TID 2418 on
executor 8: phd40010007.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO DAGScheduler: Submitting Stage 115
(MapPartitionsRDD[842] at combineByKey at ShuffledDStream.scala:42), which
has no missing parents
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:1 as 2024
bytes in 0 ms
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 783 from persistence list
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:2 as TID 2419 on
executor 6: phd40010002.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:2 as 2024
bytes in 0 ms
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:3 as TID 2420 on
executor 7: phd40010003.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 783
15/02/10 13:34:54 INFO MapPartitionsRDD: Removing RDD 782 from persistence
list
15/02/10 13:34:54 INFO DAGScheduler: Submitting 10 missing tasks from Stage
115 (MapPartitionsRDD[842] at combineByKey at ShuffledDStream.scala:42)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 782
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:3 as 2024
bytes in 0 ms
15/02/10 13:34:54 INFO YarnClusterScheduler: Adding task set 115.0 with 10
tasks
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:4 as TID 2421 on
executor 1: phd40010027.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 779 from persistence list
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:4 as 2024
bytes in 0 ms
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:5 as TID 2422 on
executor 4: phd40010023.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 779
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:5 as 2024
bytes in 0 ms
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:6 as TID 2423 on
executor 8: phd40010007.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 778 from persistence list
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:6 as 2024
bytes in 0 ms
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:7 as TID 2424 on
executor 6: phd40010002.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO BlockManager: Removing RDD 778
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:7 as 2024
bytes in 1 ms
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 777 from persistence list
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:8 as TID 2425 on
executor 7: phd40010003.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:8 as 2024
bytes in 0 ms
15/02/10 13:34:54 INFO BlockManager: Removing RDD 777
15/02/10 13:34:54 INFO TaskSetManager: Starting task 114.0:9 as TID 2426 on
executor 1: phd40010027.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 114.0:9 as 2024
bytes in 0 ms
15/02/10 13:34:54 INFO FlatMappedRDD: Removing RDD 776 from persistence list
15/02/10 13:34:54 INFO BlockManager: Removing RDD 776
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 775 from persistence list
15/02/10 13:34:54 INFO BlockManager: Removing RDD 775
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 774 from persistence list
15/02/10 13:34:54 INFO TaskSetManager: Starting task 115.0:0 as TID 2427 on
executor 4: phd40010023.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 115.0:0 as 1910
bytes in 0 ms
15/02/10 13:34:54 INFO BlockManager: Removing RDD 774
15/02/10 13:34:54 INFO TaskSetManager: Starting task 115.0:1 as TID 2428 on
executor 6: phd40010002.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 115.0:1 as 1910
bytes in 0 ms
15/02/10 13:34:54 INFO MapPartitionsRDD: Removing RDD 773 from persistence
list
15/02/10 13:34:54 INFO TaskSetManager: Starting task 115.0:2 as TID 2429 on
executor 7: phd40010003.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 115.0:2 as 1910
bytes in 0 ms
15/02/10 13:34:54 INFO BlockManager: Removing RDD 773
15/02/10 13:34:54 INFO TaskSetManager: Starting task 115.0:3 as TID 2430 on
executor 8: phd40010007.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 115.0:3 as 1910
bytes in 0 ms
15/02/10 13:34:54 INFO MappedRDD: Removing RDD 770 from persistence list
15/02/10 13:34:54 INFO BlockManager: Removing RDD 770
15/02/10 13:34:54 INFO FilteredRDD: Removing RDD 769 from persistence list
15/02/10 13:34:54 INFO BlockManager: Removing RDD 769
15/02/10 13:34:54 INFO DAGScheduler: Submitting Stage 116
(MapPartitionsRDD[855] at combineByKey at ShuffledDStream.scala:42), which
has no missing parents
15/02/10 13:34:54 INFO DAGScheduler: Submitting 10 missing tasks from Stage
116 (MapPartitionsRDD[855] at combineByKey at ShuffledDStream.scala:42)
15/02/10 13:34:54 INFO YarnClusterScheduler: Adding task set 116.0 with 10
tasks
15/02/10 13:34:54 INFO DAGScheduler: Submitting Stage 118
(MapPartitionsRDD[862] at combineByKey at ShuffledDStream.scala:42), which
has no missing parents
15/02/10 13:34:54 INFO DAGScheduler: Submitting 10 missing tasks from Stage
118 (MapPartitionsRDD[862] at combineByKey at ShuffledDStream.scala:42)
15/02/10 13:34:54 INFO YarnClusterScheduler: Adding task set 118.0 with 10
tasks
15/02/10 13:34:54 INFO TaskSetManager: Starting task 115.0:4 as TID 2431 on
executor 1: phd40010027.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 115.0:4 as 1910
bytes in 0 ms
15/02/10 13:34:54 INFO TaskSetManager: Starting task 115.0:5 as TID 2432 on
executor 1: phd40010027.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 115.0:5 as 1910
bytes in 0 ms
15/02/10 13:34:54 INFO TaskSetManager: Starting task 115.0:6 as TID 2433 on
executor 4: phd40010023.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 115.0:6 as 1910
bytes in 0 ms
15/02/10 13:34:54 INFO TaskSetManager: Starting task 115.0:7 as TID 2434 on
executor 4: phd40010023.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 115.0:7 as 1910
bytes in 0 ms
15/02/10 13:34:54 INFO TaskSetManager: Starting task 115.0:8 as TID 2435 on
executor 4: phd40010023.na.com (PROCESS_LOCAL)
15/02/10 13:34:54 INFO TaskSetManager: Serialized task 115.0:8 as 1910
bytes in 0 ms
15/02/10 13:34:54 WARN TaskSetManager: Lost TID 2421 (task 114.0:4)
15/02/10 13:34:54 WARN TaskSetManager: Loss was due to java.lang.Exception
java.lang.Exception: Could not compute split, block input-0-1423593163000
not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/10 13:34:54 WARN TaskSetManager: Lost TID 2426 (task 114.0:9)
15/02/10 13:34:54 WARN TaskSetManager: Loss was due to java.lang.Exception
java.lang.Exception: Could not compute split, block input-0-1423593164000
not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

On Tue, Feb 10, 2015 at 1:07 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> You should be able to replace that second line with
>
> val sc = ssc.sparkContext
>
> On Tue, Feb 10, 2015 at 10:04 AM, Jon Gregg <jonrgr...@gmail.com> wrote:
>
>> They're separate in my code, how can I combine them?  Here's what I have:
>>
>>       val sparkConf = new SparkConf()
>>       val ssc =  new StreamingContext(sparkConf, Seconds(bucketSecs))
>>
>>       val sc = new SparkContext()
>>
>> On Tue, Feb 10, 2015 at 1:02 PM, Sandy Ryza <sandy.r...@cloudera.com>
>> wrote:
>>
>>> Is the SparkContext you're using the same one that the StreamingContext
>>> wraps?  If not, I don't think using two is supported.
>>>
>>> -Sandy
>>>
>>> On Tue, Feb 10, 2015 at 9:58 AM, Jon Gregg <jonrgr...@gmail.com> wrote:
>>>
>>>> I'm still getting an error.  Here's my code, which works successfully
>>>> when tested using spark-shell:
>>>>
>>>>       val badIPs = sc.textFile("/user/sb/badfullIPs.csv").collect
>>>>       val badIpSet = badIPs.toSet
>>>>       val badIPsBC = sc.broadcast(badIpSet)
>>>>
>>>>
>>>> The job looks OK from my end:
>>>>
>>>> 15/02/07 18:59:58 INFO Client: Application report from ASM:
>>>>
>>>>          application identifier: application_1423081782629_3861
>>>>
>>>>          appId: 3861
>>>>
>>>> *         clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }*
>>>>
>>>>          appDiagnostics:
>>>>
>>>>          appMasterHost: phd40010008.na.com
>>>>
>>>>          appQueue: root.default
>>>>
>>>>          appMasterRpcPort: 0
>>>>
>>>>          appStartTime: 1423353581140
>>>>
>>>> *         yarnAppState: RUNNING*
>>>>
>>>>          distributedFinalState: UNDEFINED
>>>>
>>>>
>>>> But the streaming process never actually begins.  The full log is
>>>> below, scroll to the end for the repeated warning "WARN
>>>> YarnClusterScheduler: Initial job has not accepted any resources; check
>>>> your cluster UI to ensure that workers are registered and have sufficient
>>>> memory".
>>>>
>>>> I'll note that I have a different Spark Streaming app called "dqd"
>>>> working successfully for a different job that uses only a StreamingContext
>>>> and not an additional SparkContext.  But this app (called "sbStreamingTv")
>>>> uses both a SparkContext and a StreamingContext for grabbing a lookup file
>>>> in HDFS for IP filtering. * The references to line #198 from the log
>>>> below refers to the "val badIPs =
>>>> sc.textFile("/user/sb/badfullIPs.csv").collect" line shown above, and it
>>>> looks like Spark doesn't get beyond that point in the code.*
>>>>
>>>> Also, this job ("sbStreamingTv") does work successfully using
>>>> yarn-client, even with both a SparkContext and StreamingContext.  It looks
>>>> to me that in yarn-cluster mode it's grabbing resources for the
>>>> StreamingContext but not for the SparkContext.
>>>>
>>>> Any ideas?
>>>>
>>>> Jon
>>>>
>>>>
>>>> 15/02/10 12:06:16 INFO MemoryStore: MemoryStore started with capacity
>>>> 1177.8 MB.
>>>> 15/02/10 12:06:16 INFO ConnectionManager: Bound socket to port 30129
>>>> with id = ConnectionManagerId(phd40010008.na.com,30129)
>>>> 15/02/10 12:06:16 INFO BlockManagerMaster: Trying to register
>>>> BlockManager
>>>> 15/02/10 12:06:16 INFO BlockManagerInfo: Registering block manager
>>>> phd40010008.na.com:30129 with 1177.8 MB RAM
>>>> 15/02/10 12:06:16 INFO BlockManagerMaster: Registered BlockManager
>>>> 15/02/10 12:06:16 INFO HttpServer: Starting HTTP Server
>>>> 15/02/10 12:06:16 INFO HttpBroadcast: Broadcast server started at
>>>> http://10.229.16.108:35183
>>>> 15/02/10 12:06:16 INFO HttpFileServer: HTTP File server directory is
>>>> /hdata/12/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/container_1423081782629_7370_01_000001/tmp/spark-b73a964b-4d91-4af3-8246-48da420c1cec
>>>> 15/02/10 12:06:16 INFO HttpServer: Starting HTTP Server
>>>> 15/02/10 12:06:16 INFO JettyUtils: Adding filter:
>>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>>> 15/02/10 12:06:16 INFO SparkUI: Started SparkUI at
>>>> http://phd40010008.na.com:25869
>>>> 15/02/10 12:06:17 INFO EventLoggingListener: Logging events to
>>>> /user/spark/applicationHistory/com.na.scalaspark.sbstreamingtv-1423587976801
>>>> 15/02/10 12:06:17 INFO YarnClusterScheduler: Created
>>>> YarnClusterScheduler
>>>> 15/02/10 12:06:17 INFO ApplicationMaster$$anon$1: Adding shutdown hook
>>>> for context org.apache.spark.SparkContext@7f38095d
>>>> 15/02/10 12:06:17 INFO ApplicationMaster: Registering the
>>>> ApplicationMaster
>>>> 15/02/10 12:06:17 INFO ApplicationMaster: Allocating 3 executors.
>>>> 15/02/10 12:06:17 INFO YarnAllocationHandler: Will Allocate 3 executor
>>>> containers, each with 2432 memory
>>>> 15/02/10 12:06:17 INFO YarnAllocationHandler: Container request (host:
>>>> Any, priority: 1, capability: <memory:2432, vCores:1>
>>>> 15/02/10 12:06:17 INFO YarnAllocationHandler: Container request (host:
>>>> Any, priority: 1, capability: <memory:2432, vCores:1>
>>>> 15/02/10 12:06:17 INFO YarnAllocationHandler: Container request (host:
>>>> Any, priority: 1, capability: <memory:2432, vCores:1>
>>>> 15/02/10 12:06:20 INFO YarnClusterScheduler:
>>>> YarnClusterScheduler.postStartHook done
>>>> 15/02/10 12:06:20 WARN SparkConf: In Spark 1.0 and later
>>>> spark.local.dir will be overridden by the value set by the cluster manager
>>>> (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
>>>> 15/02/10 12:06:20 INFO SecurityManager: Changing view acls to: jg
>>>> 15/02/10 12:06:20 INFO SecurityManager: SecurityManager: authentication
>>>> disabled; ui acls disabled; users with view permissions: Set(jg)
>>>> 15/02/10 12:06:20 INFO Slf4jLogger: Slf4jLogger started
>>>> 15/02/10 12:06:20 INFO Remoting: Starting remoting
>>>> 15/02/10 12:06:20 INFO Remoting: Remoting started; listening on
>>>> addresses :[akka.tcp://sp...@phd40010008.na.com:43340]
>>>> 15/02/10 12:06:20 INFO Remoting: Remoting now listens on addresses:
>>>> [akka.tcp://sp...@phd40010008.na.com:43340]
>>>> 15/02/10 12:06:20 INFO SparkEnv: Registering MapOutputTracker
>>>> 15/02/10 12:06:20 INFO SparkEnv: Registering BlockManagerMaster
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/1/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-f6e1
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/10/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-583d
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/11/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-0b66
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/12/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-bc8f
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/2/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-17e4
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/3/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-c01e
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/4/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-915c
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/5/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-38ff
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/6/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-c92f
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/7/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-b67a
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/8/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-46fb
>>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>>> /hdata/9/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-9d11
>>>> 15/02/10 12:06:20 INFO MemoryStore: MemoryStore started with capacity
>>>> 1177.8 MB.
>>>> 15/02/10 12:06:20 INFO ConnectionManager: Bound socket to port 55944
>>>> with id = ConnectionManagerId(phd40010008.na.com,55944)
>>>> 15/02/10 12:06:20 INFO BlockManagerMaster: Trying to register
>>>> BlockManager
>>>> 15/02/10 12:06:20 INFO BlockManagerInfo: Registering block manager
>>>> phd40010008.na.com:55944 with 1177.8 MB RAM
>>>> 15/02/10 12:06:20 INFO BlockManagerMaster: Registered BlockManager
>>>> 15/02/10 12:06:20 INFO HttpFileServer: HTTP File server directory is
>>>> /hdata/12/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/container_1423081782629_7370_01_000001/tmp/spark-b3daba9d-f743-4738-b6c2-f56e56813edd
>>>> 15/02/10 12:06:20 INFO HttpServer: Starting HTTP Server
>>>> 15/02/10 12:06:20 INFO JettyUtils: Adding filter:
>>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>>> 15/02/10 12:06:20 INFO SparkUI: Started SparkUI at
>>>> http://phd40010008.na.com:10612
>>>> 15/02/10 12:06:20 INFO EventLoggingListener: Logging events to
>>>> /user/spark/applicationHistory/com.na.scalaspark.sbstreamingtv-1423587980782
>>>> 15/02/10 12:06:20 INFO YarnClusterScheduler: Created
>>>> YarnClusterScheduler
>>>> 15/02/10 12:06:20 INFO YarnClusterScheduler:
>>>> YarnClusterScheduler.postStartHook done
>>>> 15/02/10 12:06:21 INFO MemoryStore: ensureFreeSpace(253715) called with
>>>> curMem=0, maxMem=1235012812
>>>> 15/02/10 12:06:21 INFO MemoryStore: Block broadcast_0 stored as values
>>>> to memory (estimated size 247.8 KB, free 1177.6 MB)
>>>> 15/02/10 12:06:21 INFO FileInputFormat: Total input paths to process : 1
>>>> 15/02/10 12:06:21 INFO SparkContext: Starting job: collect at
>>>> sbStreamingTv.scala:198
>>>> 15/02/10 12:06:21 INFO DAGScheduler: Got job 0 (collect at
>>>> sbStreamingTv.scala:198) with 2 output partitions (allowLocal=false)
>>>> 15/02/10 12:06:21 INFO DAGScheduler: Final stage: Stage 0(*collect at
>>>> sbStreamingTv.scala:198*)
>>>> 15/02/10 12:06:21 INFO DAGScheduler: Parents of final stage: List()
>>>> 15/02/10 12:06:21 INFO DAGScheduler: Missing parents: List()
>>>> 15/02/10 12:06:21 INFO DAGScheduler: Submitting Stage 0 (*MappedRDD[1]
>>>> at textFile at sbStreamingTv.scala:198*), which has no missing parents
>>>> 15/02/10 12:06:21 INFO DAGScheduler: Submitting 2 missing tasks from
>>>> Stage 0 (*MappedRDD[1] at textFile at sbStreamingTv.scala:198*)
>>>> 15/02/10 12:06:21 INFO YarnClusterScheduler: Adding task set 0.0 with 2
>>>> tasks
>>>> 15/02/10 12:06:21 INFO AMRMClientImpl: Received new token for :
>>>> phd40010024.na.com:8041
>>>> 15/02/10 12:06:21 INFO AMRMClientImpl: Received new token for :
>>>> phd40010002.na.com:8041
>>>> 15/02/10 12:06:21 INFO AMRMClientImpl: Received new token for :
>>>> phd40010022.na.com:8041
>>>> 15/02/10 12:06:21 INFO RackResolver: Resolved phd40010002.na.com to
>>>> /sdc/c4h5
>>>> 15/02/10 12:06:21 INFO RackResolver: Resolved phd40010022.na.com to
>>>> /sdc/c4h5
>>>> 15/02/10 12:06:21 INFO RackResolver: Resolved phd40010024.na.com to
>>>> /sdc/c4h1
>>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching container
>>>> container_1423081782629_7370_01_000003 for on host phd40010002.na.com
>>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching
>>>> ExecutorRunnable. driverUrl: akka.tcp://
>>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler,
>>>>  executorHostname: phd40010002.na.com
>>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching container
>>>> container_1423081782629_7370_01_000004 for on host phd40010022.na.com
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Starting Executor Container
>>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching
>>>> ExecutorRunnable. driverUrl: akka.tcp://
>>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler,
>>>>  executorHostname: phd40010022.na.com
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Starting Executor Container
>>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching container
>>>> container_1423081782629_7370_01_000002 for on host phd40010024.na.com
>>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching
>>>> ExecutorRunnable. driverUrl: akka.tcp://
>>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler,
>>>>  executorHostname: phd40010024.na.com
>>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy:
>>>> yarn.client.max-nodemanagers-proxies : 500
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Starting Executor Container
>>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy:
>>>> yarn.client.max-nodemanagers-proxies : 500
>>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy:
>>>> yarn.client.max-nodemanagers-proxies : 500
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up
>>>> ContainerLaunchContext
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up
>>>> ContainerLaunchContext
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up
>>>> ContainerLaunchContext
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Preparing Local resources
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Preparing Local resources
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Preparing Local resources
>>>> 15/02/10 12:06:21 INFO ApplicationMaster: All executors have launched.
>>>> 15/02/10 12:06:21 INFO ApplicationMaster: Started progress reporter
>>>> thread - sleep time : 5000
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Prepared Local resources
>>>> Map(__spark__.jar -> resource { scheme: "hdfs" host: "nameservice1" port:
>>>> -1 file:
>>>> "/user/jg/.sparkStaging/application_1423081782629_7370/spark-assembly-1.0.0-cdh5.1.3-hadoop2.3.0-cdh5.1.3.jar"
>>>> } size: 93542713 timestamp: 1423587960750 type: FILE visibility: PRIVATE,
>>>> __app__.jar -> resource { scheme: "hdfs" host: "nameservice1" port: -1
>>>> file:
>>>> "/user/jg/.sparkStaging/application_1423081782629_7370/sbStreamingTv-0.0.1-SNAPSHOT-jar-with-dependencies.jar"
>>>> } size: 95950353 timestamp: 1423587960370 type: FILE visibility: PRIVATE)
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Prepared Local resources
>>>> Map(__spark__.jar -> resource { scheme: "hdfs" host: "nameservice1" port:
>>>> -1 file:
>>>> "/user/jg/.sparkStaging/application_1423081782629_7370/spark-assembly-1.0.0-cdh5.1.3-hadoop2.3.0-cdh5.1.3.jar"
>>>> } size: 93542713 timestamp: 1423587960750 type: FILE visibility: PRIVATE,
>>>> __app__.jar -> resource { scheme: "hdfs" host: "nameservice1" port: -1
>>>> file:
>>>> "/user/jg/.sparkStaging/application_1423081782629_7370/sbStreamingTv-0.0.1-SNAPSHOT-jar-with-dependencies.jar"
>>>> } size: 95950353 timestamp: 1423587960370 type: FILE visibility: PRIVATE)
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Prepared Local resources
>>>> Map(__spark__.jar -> resource { scheme: "hdfs" host: "nameservice1" port:
>>>> -1 file:
>>>> "/user/jg/.sparkStaging/application_1423081782629_7370/spark-assembly-1.0.0-cdh5.1.3-hadoop2.3.0-cdh5.1.3.jar"
>>>> } size: 93542713 timestamp: 1423587960750 type: FILE visibility: PRIVATE,
>>>> __app__.jar -> resource { scheme: "hdfs" host: "nameservice1" port: -1
>>>> file:
>>>> "/user/jg/.sparkStaging/application_1423081782629_7370/sbStreamingTv-0.0.1-SNAPSHOT-jar-with-dependencies.jar"
>>>> } size: 95950353 timestamp: 1423587960370 type: FILE visibility: PRIVATE)
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up executor with
>>>> commands: List($JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill
>>>> %p', -Xms2048m -Xmx2048m , -Djava.io.tmpdir=$PWD/tmp,
>>>>  -Dlog4j.configuration=log4j-spark-container.properties,
>>>> org.apache.spark.executor.CoarseGrainedExecutorBackend, akka.tcp://
>>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler, 1,
>>>> phd40010002.na.com, 1, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up executor with
>>>> commands: List($JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill
>>>> %p', -Xms2048m -Xmx2048m , -Djava.io.tmpdir=$PWD/tmp,
>>>>  -Dlog4j.configuration=log4j-spark-container.properties,
>>>> org.apache.spark.executor.CoarseGrainedExecutorBackend, akka.tcp://
>>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler, 3,
>>>> phd40010024.na.com, 1, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
>>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up executor with
>>>> commands: List($JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill
>>>> %p', -Xms2048m -Xmx2048m , -Djava.io.tmpdir=$PWD/tmp,
>>>>  -Dlog4j.configuration=log4j-spark-container.properties,
>>>> org.apache.spark.executor.CoarseGrainedExecutorBackend, akka.tcp://
>>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler, 2,
>>>> phd40010022.na.com, 1, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
>>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy: Opening proxy
>>>> : phd40010022.na.com:8041
>>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy: Opening proxy
>>>> : phd40010024.na.com:8041
>>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy: Opening proxy
>>>> : phd40010002.na.com:8041
>>>> 15/02/10 12:06:26 INFO CoarseGrainedSchedulerBackend: Registered
>>>> executor: Actor[akka.tcp://
>>>> sparkexecu...@phd40010022.na.com:29369/user/Executor#43651774] with ID
>>>> 2
>>>> 15/02/10 12:06:26 INFO CoarseGrainedSchedulerBackend: Registered
>>>> executor: Actor[akka.tcp://
>>>> sparkexecu...@phd40010024.na.com:12969/user/Executor#1711844295] with
>>>> ID 3
>>>> 15/02/10 12:06:26 INFO BlockManagerInfo: Registering block manager
>>>> phd40010022.na.com:14119 with 1178.1 MB RAM
>>>> 15/02/10 12:06:26 INFO BlockManagerInfo: Registering block manager
>>>> phd40010024.na.com:53284 with 1178.1 MB RAM
>>>> 15/02/10 12:06:29 INFO CoarseGrainedSchedulerBackend: Registered
>>>> executor: Actor[akka.tcp://
>>>> sparkexecu...@phd40010002.na.com:35547/user/Executor#-1690254909] with
>>>> ID 1
>>>> 15/02/10 12:06:29 INFO BlockManagerInfo: Registering block manager
>>>> phd40010002.na.com:62754 with 1178.1 MB RAM
>>>> 15/02/10 12:06:36 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:06:51 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:07:06 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:07:21 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:07:36 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:07:51 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:08:06 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:08:21 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:08:36 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:08:51 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:09:06 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:09:21 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:09:36 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:09:51 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:10:06 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:10:21 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:10:36 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>> 15/02/10 12:10:51 WARN YarnClusterScheduler: Initial job has not
>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>> registered and have sufficient memory
>>>>
>>>> On Fri, Feb 6, 2015 at 3:24 PM, Sandy Ryza <sandy.r...@cloudera.com>
>>>> wrote:
>>>>
>>>>> You can call collect() to pull in the contents of an RDD into the
>>>>> driver:
>>>>>
>>>>>   val badIPsLines = badIPs.collect()
>>>>>
>>>>> On Fri, Feb 6, 2015 at 12:19 PM, Jon Gregg <jonrgr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> OK I tried that, but how do I convert an RDD to a Set that I can then
>>>>>> broadcast and cache?
>>>>>>
>>>>>>       val badIPs = sc.textFile("hdfs:///user/jon/"+ "badfullIPs.csv")
>>>>>>       val badIPsLines = badIPs.getLines
>>>>>>       val badIpSet = badIPsLines.toSet
>>>>>>       val badIPsBC = sc.broadcast(badIpSet)
>>>>>>
>>>>>> produces the error "value getLines is not a member of
>>>>>> org.apache.spark.rdd.RDD[String]".
>>>>>>
>>>>>> Leaving it as an RDD and then constantly joining I think will be too
>>>>>> slow for a streaming job.
>>>>>>
>>>>>> On Thu, Feb 5, 2015 at 8:06 PM, Sandy Ryza <sandy.r...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Jon,
>>>>>>>
>>>>>>> You'll need to put the file on HDFS (or whatever distributed
>>>>>>> filesystem you're running on) and load it from there.
>>>>>>>
>>>>>>> -Sandy
>>>>>>>
>>>>>>> On Thu, Feb 5, 2015 at 3:18 PM, YaoPau <jonrgr...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I have a file "badFullIPs.csv" of bad IP addresses used for
>>>>>>>> filtering.  In
>>>>>>>> yarn-client mode, I simply read it off the edge node, transform it,
>>>>>>>> and then
>>>>>>>> broadcast it:
>>>>>>>>
>>>>>>>>       val badIPs = fromFile(edgeDir + "badfullIPs.csv")
>>>>>>>>       val badIPsLines = badIPs.getLines
>>>>>>>>       val badIpSet = badIPsLines.toSet
>>>>>>>>       val badIPsBC = sc.broadcast(badIpSet)
>>>>>>>>       badIPs.close
>>>>>>>>
>>>>>>>> How can I accomplish this in yarn-cluster mode?
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-variable-read-from-a-file-in-yarn-cluster-mode-tp21524.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to broadcast a variable read from a file in yarn-cluster mode?

Reply via email to