Re: How to broadcast a variable read from a file in yarn-cluster mode?

Sandy Ryza Tue, 10 Feb 2015 10:10:09 -0800

You should be able to replace that second line with

val sc = ssc.sparkContext


On Tue, Feb 10, 2015 at 10:04 AM, Jon Gregg <jonrgr...@gmail.com> wrote:

> They're separate in my code, how can I combine them?  Here's what I have:
>
>       val sparkConf = new SparkConf()
>       val ssc =  new StreamingContext(sparkConf, Seconds(bucketSecs))
>
>       val sc = new SparkContext()
>
> On Tue, Feb 10, 2015 at 1:02 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
>
>> Is the SparkContext you're using the same one that the StreamingContext
>> wraps?  If not, I don't think using two is supported.
>>
>> -Sandy
>>
>> On Tue, Feb 10, 2015 at 9:58 AM, Jon Gregg <jonrgr...@gmail.com> wrote:
>>
>>> I'm still getting an error.  Here's my code, which works successfully
>>> when tested using spark-shell:
>>>
>>>       val badIPs = sc.textFile("/user/sb/badfullIPs.csv").collect
>>>       val badIpSet = badIPs.toSet
>>>       val badIPsBC = sc.broadcast(badIpSet)
>>>
>>>
>>> The job looks OK from my end:
>>>
>>> 15/02/07 18:59:58 INFO Client: Application report from ASM:
>>>
>>>          application identifier: application_1423081782629_3861
>>>
>>>          appId: 3861
>>>
>>> *         clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service:  }*
>>>
>>>          appDiagnostics:
>>>
>>>          appMasterHost: phd40010008.na.com
>>>
>>>          appQueue: root.default
>>>
>>>          appMasterRpcPort: 0
>>>
>>>          appStartTime: 1423353581140
>>>
>>> *         yarnAppState: RUNNING*
>>>
>>>          distributedFinalState: UNDEFINED
>>>
>>>
>>> But the streaming process never actually begins.  The full log is below,
>>> scroll to the end for the repeated warning "WARN YarnClusterScheduler:
>>> Initial job has not accepted any resources; check your cluster UI to ensure
>>> that workers are registered and have sufficient memory".
>>>
>>> I'll note that I have a different Spark Streaming app called "dqd"
>>> working successfully for a different job that uses only a StreamingContext
>>> and not an additional SparkContext.  But this app (called "sbStreamingTv")
>>> uses both a SparkContext and a StreamingContext for grabbing a lookup file
>>> in HDFS for IP filtering. * The references to line #198 from the log
>>> below refers to the "val badIPs =
>>> sc.textFile("/user/sb/badfullIPs.csv").collect" line shown above, and it
>>> looks like Spark doesn't get beyond that point in the code.*
>>>
>>> Also, this job ("sbStreamingTv") does work successfully using
>>> yarn-client, even with both a SparkContext and StreamingContext.  It looks
>>> to me that in yarn-cluster mode it's grabbing resources for the
>>> StreamingContext but not for the SparkContext.
>>>
>>> Any ideas?
>>>
>>> Jon
>>>
>>>
>>> 15/02/10 12:06:16 INFO MemoryStore: MemoryStore started with capacity
>>> 1177.8 MB.
>>> 15/02/10 12:06:16 INFO ConnectionManager: Bound socket to port 30129
>>> with id = ConnectionManagerId(phd40010008.na.com,30129)
>>> 15/02/10 12:06:16 INFO BlockManagerMaster: Trying to register
>>> BlockManager
>>> 15/02/10 12:06:16 INFO BlockManagerInfo: Registering block manager
>>> phd40010008.na.com:30129 with 1177.8 MB RAM
>>> 15/02/10 12:06:16 INFO BlockManagerMaster: Registered BlockManager
>>> 15/02/10 12:06:16 INFO HttpServer: Starting HTTP Server
>>> 15/02/10 12:06:16 INFO HttpBroadcast: Broadcast server started at
>>> http://10.229.16.108:35183
>>> 15/02/10 12:06:16 INFO HttpFileServer: HTTP File server directory is
>>> /hdata/12/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/container_1423081782629_7370_01_000001/tmp/spark-b73a964b-4d91-4af3-8246-48da420c1cec
>>> 15/02/10 12:06:16 INFO HttpServer: Starting HTTP Server
>>> 15/02/10 12:06:16 INFO JettyUtils: Adding filter:
>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>> 15/02/10 12:06:16 INFO SparkUI: Started SparkUI at
>>> http://phd40010008.na.com:25869
>>> 15/02/10 12:06:17 INFO EventLoggingListener: Logging events to
>>> /user/spark/applicationHistory/com.na.scalaspark.sbstreamingtv-1423587976801
>>> 15/02/10 12:06:17 INFO YarnClusterScheduler: Created YarnClusterScheduler
>>> 15/02/10 12:06:17 INFO ApplicationMaster$$anon$1: Adding shutdown hook
>>> for context org.apache.spark.SparkContext@7f38095d
>>> 15/02/10 12:06:17 INFO ApplicationMaster: Registering the
>>> ApplicationMaster
>>> 15/02/10 12:06:17 INFO ApplicationMaster: Allocating 3 executors.
>>> 15/02/10 12:06:17 INFO YarnAllocationHandler: Will Allocate 3 executor
>>> containers, each with 2432 memory
>>> 15/02/10 12:06:17 INFO YarnAllocationHandler: Container request (host:
>>> Any, priority: 1, capability: <memory:2432, vCores:1>
>>> 15/02/10 12:06:17 INFO YarnAllocationHandler: Container request (host:
>>> Any, priority: 1, capability: <memory:2432, vCores:1>
>>> 15/02/10 12:06:17 INFO YarnAllocationHandler: Container request (host:
>>> Any, priority: 1, capability: <memory:2432, vCores:1>
>>> 15/02/10 12:06:20 INFO YarnClusterScheduler:
>>> YarnClusterScheduler.postStartHook done
>>> 15/02/10 12:06:20 WARN SparkConf: In Spark 1.0 and later spark.local.dir
>>> will be overridden by the value set by the cluster manager (via
>>> SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
>>> 15/02/10 12:06:20 INFO SecurityManager: Changing view acls to: jg
>>> 15/02/10 12:06:20 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(jg)
>>> 15/02/10 12:06:20 INFO Slf4jLogger: Slf4jLogger started
>>> 15/02/10 12:06:20 INFO Remoting: Starting remoting
>>> 15/02/10 12:06:20 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://sp...@phd40010008.na.com:43340]
>>> 15/02/10 12:06:20 INFO Remoting: Remoting now listens on addresses:
>>> [akka.tcp://sp...@phd40010008.na.com:43340]
>>> 15/02/10 12:06:20 INFO SparkEnv: Registering MapOutputTracker
>>> 15/02/10 12:06:20 INFO SparkEnv: Registering BlockManagerMaster
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/1/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-f6e1
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/10/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-583d
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/11/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-0b66
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/12/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-bc8f
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/2/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-17e4
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/3/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-c01e
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/4/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-915c
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/5/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-38ff
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/6/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-c92f
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/7/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-b67a
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/8/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-46fb
>>> 15/02/10 12:06:20 INFO DiskBlockManager: Created local directory at
>>> /hdata/9/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/spark-local-20150210120620-9d11
>>> 15/02/10 12:06:20 INFO MemoryStore: MemoryStore started with capacity
>>> 1177.8 MB.
>>> 15/02/10 12:06:20 INFO ConnectionManager: Bound socket to port 55944
>>> with id = ConnectionManagerId(phd40010008.na.com,55944)
>>> 15/02/10 12:06:20 INFO BlockManagerMaster: Trying to register
>>> BlockManager
>>> 15/02/10 12:06:20 INFO BlockManagerInfo: Registering block manager
>>> phd40010008.na.com:55944 with 1177.8 MB RAM
>>> 15/02/10 12:06:20 INFO BlockManagerMaster: Registered BlockManager
>>> 15/02/10 12:06:20 INFO HttpFileServer: HTTP File server directory is
>>> /hdata/12/yarn/nm/usercache/jg/appcache/application_1423081782629_7370/container_1423081782629_7370_01_000001/tmp/spark-b3daba9d-f743-4738-b6c2-f56e56813edd
>>> 15/02/10 12:06:20 INFO HttpServer: Starting HTTP Server
>>> 15/02/10 12:06:20 INFO JettyUtils: Adding filter:
>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>> 15/02/10 12:06:20 INFO SparkUI: Started SparkUI at
>>> http://phd40010008.na.com:10612
>>> 15/02/10 12:06:20 INFO EventLoggingListener: Logging events to
>>> /user/spark/applicationHistory/com.na.scalaspark.sbstreamingtv-1423587980782
>>> 15/02/10 12:06:20 INFO YarnClusterScheduler: Created YarnClusterScheduler
>>> 15/02/10 12:06:20 INFO YarnClusterScheduler:
>>> YarnClusterScheduler.postStartHook done
>>> 15/02/10 12:06:21 INFO MemoryStore: ensureFreeSpace(253715) called with
>>> curMem=0, maxMem=1235012812
>>> 15/02/10 12:06:21 INFO MemoryStore: Block broadcast_0 stored as values
>>> to memory (estimated size 247.8 KB, free 1177.6 MB)
>>> 15/02/10 12:06:21 INFO FileInputFormat: Total input paths to process : 1
>>> 15/02/10 12:06:21 INFO SparkContext: Starting job: collect at
>>> sbStreamingTv.scala:198
>>> 15/02/10 12:06:21 INFO DAGScheduler: Got job 0 (collect at
>>> sbStreamingTv.scala:198) with 2 output partitions (allowLocal=false)
>>> 15/02/10 12:06:21 INFO DAGScheduler: Final stage: Stage 0(*collect at
>>> sbStreamingTv.scala:198*)
>>> 15/02/10 12:06:21 INFO DAGScheduler: Parents of final stage: List()
>>> 15/02/10 12:06:21 INFO DAGScheduler: Missing parents: List()
>>> 15/02/10 12:06:21 INFO DAGScheduler: Submitting Stage 0 (*MappedRDD[1]
>>> at textFile at sbStreamingTv.scala:198*), which has no missing parents
>>> 15/02/10 12:06:21 INFO DAGScheduler: Submitting 2 missing tasks from
>>> Stage 0 (*MappedRDD[1] at textFile at sbStreamingTv.scala:198*)
>>> 15/02/10 12:06:21 INFO YarnClusterScheduler: Adding task set 0.0 with 2
>>> tasks
>>> 15/02/10 12:06:21 INFO AMRMClientImpl: Received new token for :
>>> phd40010024.na.com:8041
>>> 15/02/10 12:06:21 INFO AMRMClientImpl: Received new token for :
>>> phd40010002.na.com:8041
>>> 15/02/10 12:06:21 INFO AMRMClientImpl: Received new token for :
>>> phd40010022.na.com:8041
>>> 15/02/10 12:06:21 INFO RackResolver: Resolved phd40010002.na.com to
>>> /sdc/c4h5
>>> 15/02/10 12:06:21 INFO RackResolver: Resolved phd40010022.na.com to
>>> /sdc/c4h5
>>> 15/02/10 12:06:21 INFO RackResolver: Resolved phd40010024.na.com to
>>> /sdc/c4h1
>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching container
>>> container_1423081782629_7370_01_000003 for on host phd40010002.na.com
>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching
>>> ExecutorRunnable. driverUrl: akka.tcp://
>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler,
>>>  executorHostname: phd40010002.na.com
>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching container
>>> container_1423081782629_7370_01_000004 for on host phd40010022.na.com
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Starting Executor Container
>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching
>>> ExecutorRunnable. driverUrl: akka.tcp://
>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler,
>>>  executorHostname: phd40010022.na.com
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Starting Executor Container
>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching container
>>> container_1423081782629_7370_01_000002 for on host phd40010024.na.com
>>> 15/02/10 12:06:21 INFO YarnAllocationHandler: Launching
>>> ExecutorRunnable. driverUrl: akka.tcp://
>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler,
>>>  executorHostname: phd40010024.na.com
>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy:
>>> yarn.client.max-nodemanagers-proxies : 500
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Starting Executor Container
>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy:
>>> yarn.client.max-nodemanagers-proxies : 500
>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy:
>>> yarn.client.max-nodemanagers-proxies : 500
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up
>>> ContainerLaunchContext
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up
>>> ContainerLaunchContext
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up
>>> ContainerLaunchContext
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Preparing Local resources
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Preparing Local resources
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Preparing Local resources
>>> 15/02/10 12:06:21 INFO ApplicationMaster: All executors have launched.
>>> 15/02/10 12:06:21 INFO ApplicationMaster: Started progress reporter
>>> thread - sleep time : 5000
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Prepared Local resources
>>> Map(__spark__.jar -> resource { scheme: "hdfs" host: "nameservice1" port:
>>> -1 file:
>>> "/user/jg/.sparkStaging/application_1423081782629_7370/spark-assembly-1.0.0-cdh5.1.3-hadoop2.3.0-cdh5.1.3.jar"
>>> } size: 93542713 timestamp: 1423587960750 type: FILE visibility: PRIVATE,
>>> __app__.jar -> resource { scheme: "hdfs" host: "nameservice1" port: -1
>>> file:
>>> "/user/jg/.sparkStaging/application_1423081782629_7370/sbStreamingTv-0.0.1-SNAPSHOT-jar-with-dependencies.jar"
>>> } size: 95950353 timestamp: 1423587960370 type: FILE visibility: PRIVATE)
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Prepared Local resources
>>> Map(__spark__.jar -> resource { scheme: "hdfs" host: "nameservice1" port:
>>> -1 file:
>>> "/user/jg/.sparkStaging/application_1423081782629_7370/spark-assembly-1.0.0-cdh5.1.3-hadoop2.3.0-cdh5.1.3.jar"
>>> } size: 93542713 timestamp: 1423587960750 type: FILE visibility: PRIVATE,
>>> __app__.jar -> resource { scheme: "hdfs" host: "nameservice1" port: -1
>>> file:
>>> "/user/jg/.sparkStaging/application_1423081782629_7370/sbStreamingTv-0.0.1-SNAPSHOT-jar-with-dependencies.jar"
>>> } size: 95950353 timestamp: 1423587960370 type: FILE visibility: PRIVATE)
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Prepared Local resources
>>> Map(__spark__.jar -> resource { scheme: "hdfs" host: "nameservice1" port:
>>> -1 file:
>>> "/user/jg/.sparkStaging/application_1423081782629_7370/spark-assembly-1.0.0-cdh5.1.3-hadoop2.3.0-cdh5.1.3.jar"
>>> } size: 93542713 timestamp: 1423587960750 type: FILE visibility: PRIVATE,
>>> __app__.jar -> resource { scheme: "hdfs" host: "nameservice1" port: -1
>>> file:
>>> "/user/jg/.sparkStaging/application_1423081782629_7370/sbStreamingTv-0.0.1-SNAPSHOT-jar-with-dependencies.jar"
>>> } size: 95950353 timestamp: 1423587960370 type: FILE visibility: PRIVATE)
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up executor with
>>> commands: List($JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill
>>> %p', -Xms2048m -Xmx2048m , -Djava.io.tmpdir=$PWD/tmp,
>>>  -Dlog4j.configuration=log4j-spark-container.properties,
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend, akka.tcp://
>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler, 1,
>>> phd40010002.na.com, 1, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up executor with
>>> commands: List($JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill
>>> %p', -Xms2048m -Xmx2048m , -Djava.io.tmpdir=$PWD/tmp,
>>>  -Dlog4j.configuration=log4j-spark-container.properties,
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend, akka.tcp://
>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler, 3,
>>> phd40010024.na.com, 1, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
>>> 15/02/10 12:06:21 INFO ExecutorRunnable: Setting up executor with
>>> commands: List($JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill
>>> %p', -Xms2048m -Xmx2048m , -Djava.io.tmpdir=$PWD/tmp,
>>>  -Dlog4j.configuration=log4j-spark-container.properties,
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend, akka.tcp://
>>> sp...@phd40010008.na.com:58240/user/CoarseGrainedScheduler, 2,
>>> phd40010022.na.com, 1, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy: Opening proxy :
>>> phd40010022.na.com:8041
>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy: Opening proxy :
>>> phd40010024.na.com:8041
>>> 15/02/10 12:06:21 INFO ContainerManagementProtocolProxy: Opening proxy :
>>> phd40010002.na.com:8041
>>> 15/02/10 12:06:26 INFO CoarseGrainedSchedulerBackend: Registered
>>> executor: Actor[akka.tcp://
>>> sparkexecu...@phd40010022.na.com:29369/user/Executor#43651774] with ID 2
>>> 15/02/10 12:06:26 INFO CoarseGrainedSchedulerBackend: Registered
>>> executor: Actor[akka.tcp://
>>> sparkexecu...@phd40010024.na.com:12969/user/Executor#1711844295] with
>>> ID 3
>>> 15/02/10 12:06:26 INFO BlockManagerInfo: Registering block manager
>>> phd40010022.na.com:14119 with 1178.1 MB RAM
>>> 15/02/10 12:06:26 INFO BlockManagerInfo: Registering block manager
>>> phd40010024.na.com:53284 with 1178.1 MB RAM
>>> 15/02/10 12:06:29 INFO CoarseGrainedSchedulerBackend: Registered
>>> executor: Actor[akka.tcp://
>>> sparkexecu...@phd40010002.na.com:35547/user/Executor#-1690254909] with
>>> ID 1
>>> 15/02/10 12:06:29 INFO BlockManagerInfo: Registering block manager
>>> phd40010002.na.com:62754 with 1178.1 MB RAM
>>> 15/02/10 12:06:36 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:06:51 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:07:06 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:07:21 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:07:36 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:07:51 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:08:06 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:08:21 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:08:36 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:08:51 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:09:06 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:09:21 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:09:36 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:09:51 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:10:06 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:10:21 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:10:36 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>> 15/02/10 12:10:51 WARN YarnClusterScheduler: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>>
>>> On Fri, Feb 6, 2015 at 3:24 PM, Sandy Ryza <sandy.r...@cloudera.com>
>>> wrote:
>>>
>>>> You can call collect() to pull in the contents of an RDD into the
>>>> driver:
>>>>
>>>>   val badIPsLines = badIPs.collect()
>>>>
>>>> On Fri, Feb 6, 2015 at 12:19 PM, Jon Gregg <jonrgr...@gmail.com> wrote:
>>>>
>>>>> OK I tried that, but how do I convert an RDD to a Set that I can then
>>>>> broadcast and cache?
>>>>>
>>>>>       val badIPs = sc.textFile("hdfs:///user/jon/"+ "badfullIPs.csv")
>>>>>       val badIPsLines = badIPs.getLines
>>>>>       val badIpSet = badIPsLines.toSet
>>>>>       val badIPsBC = sc.broadcast(badIpSet)
>>>>>
>>>>> produces the error "value getLines is not a member of
>>>>> org.apache.spark.rdd.RDD[String]".
>>>>>
>>>>> Leaving it as an RDD and then constantly joining I think will be too
>>>>> slow for a streaming job.
>>>>>
>>>>> On Thu, Feb 5, 2015 at 8:06 PM, Sandy Ryza <sandy.r...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Jon,
>>>>>>
>>>>>> You'll need to put the file on HDFS (or whatever distributed
>>>>>> filesystem you're running on) and load it from there.
>>>>>>
>>>>>> -Sandy
>>>>>>
>>>>>> On Thu, Feb 5, 2015 at 3:18 PM, YaoPau <jonrgr...@gmail.com> wrote:
>>>>>>
>>>>>>> I have a file "badFullIPs.csv" of bad IP addresses used for
>>>>>>> filtering.  In
>>>>>>> yarn-client mode, I simply read it off the edge node, transform it,
>>>>>>> and then
>>>>>>> broadcast it:
>>>>>>>
>>>>>>>       val badIPs = fromFile(edgeDir + "badfullIPs.csv")
>>>>>>>       val badIPsLines = badIPs.getLines
>>>>>>>       val badIpSet = badIPsLines.toSet
>>>>>>>       val badIPsBC = sc.broadcast(badIpSet)
>>>>>>>       badIPs.close
>>>>>>>
>>>>>>> How can I accomplish this in yarn-cluster mode?
>>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-variable-read-from-a-file-in-yarn-cluster-mode-tp21524.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to broadcast a variable read from a file in yarn-cluster mode?

Reply via email to