Re: CDH 5.0 and Spark 0.9.0
Hello Sean, Thanks a bunch, I am not currently working HA mode. The configuration is identical to our CDH4 setup which perfectly fine. It's really strange how only spark breaks with this enabled. On Thu, May 1, 2014 at 3:06 AM, Sean Owen wrote: > This codec does require native libraries to be installed, IIRC, but > they are installed with CDH 5. > > The error you show does not look related though. Are you sure your HA > setup is working and that you have configured it correctly in whatever > config spark is seeing? > -- > Sean Owen | Director, Data Science | London > > > On Thu, May 1, 2014 at 12:44 AM, Paul Schooss > wrote: > > Hello, > > > > So I was unable to run the following commands from the spark shell with > CDH > > 5.0 and spark 0.9.0, see below. > > > > Once I removed the property > > > > > > io.compression.codec.lzo.class > > com.hadoop.compression.lzo.LzoCodec > > true > > > > > > from the core-site.xml on the node, the spark commands worked. Is there a > > specific setup I am missing? > > > > scala> var log = sc.textFile("hdfs://jobs-ab-hnn1//input/core-site.xml") > > 14/04/30 22:43:16 INFO MemoryStore: ensureFreeSpace(78800) called with > > curMem=150115, maxMem=308713881 > > 14/04/30 22:43:16 INFO MemoryStore: Block broadcast_1 stored as values to > > memory (estimated size 77.0 KB, free 294.2 MB) > > 14/04/30 22:43:16 WARN Configuration: mapred-default.xml:an attempt to > > override final parameter: mapreduce.tasktracker.cache.local.size; > Ignoring. > > 14/04/30 22:43:16 WARN Configuration: yarn-site.xml:an attempt to > override > > final parameter: mapreduce.output.fileoutputformat.compress.type; > Ignoring. > > 14/04/30 22:43:16 WARN Configuration: hdfs-site.xml:an attempt to > override > > final parameter: mapreduce.map.output.compress.codec; Ignoring. > > log: org.apache.spark.rdd.RDD[String] = MappedRDD[3] at textFile at > > :12 > > > > scala> log.count() > > 14/04/30 22:43:03 WARN JobConf: The variable mapred.child.ulimit is no > > longer used. > > 14/04/30 22:43:04 WARN Configuration: mapred-default.xml:an attempt to > > override final parameter: mapreduce.tasktracker.cache.local.size; > Ignoring. > > 14/04/30 22:43:04 WARN Configuration: yarn-site.xml:an attempt to > override > > final parameter: mapreduce.output.fileoutputformat.compress.type; > Ignoring. > > 14/04/30 22:43:04 WARN Configuration: hdfs-site.xml:an attempt to > override > > final parameter: mapreduce.map.output.compress.codec; Ignoring. > > java.lang.IllegalArgumentException: java.net.UnknownHostException: > > jobs-a-hnn1 > > at > > > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) > > at > > > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:237) > > at > > > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141) > > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:576) > > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:521) > > at > > > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:146) > > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397) > > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) > > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) > > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) > > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) > > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) > > at > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221) > > at > > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) > > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > > at scala.Option.getOrElse(Option.scala:120) > > at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) > > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > > at scala.Option.getOrElse(Option.scala:120) > > at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) > > at org.apache.spark.SparkContext.runJob(SparkContext.scala:902) > > at org.apache.spark.rdd.RDD.count(RDD.sc
CDH 5.0 and Spark 0.9.0
Hello, So I was unable to run the following commands from the spark shell with CDH 5.0 and spark 0.9.0, see below. Once I removed the property io.compression.codec.lzo.class com.hadoop.compression.lzo.LzoCodec true from the core-site.xml on the node, the spark commands worked. Is there a specific setup I am missing? scala> var log = sc.textFile("hdfs://jobs-ab-hnn1//input/core-site.xml") 14/04/30 22:43:16 INFO MemoryStore: ensureFreeSpace(78800) called with curMem=150115, maxMem=308713881 14/04/30 22:43:16 INFO MemoryStore: Block broadcast_1 stored as values to memory (estimated size 77.0 KB, free 294.2 MB) 14/04/30 22:43:16 WARN Configuration: mapred-default.xml:an attempt to override final parameter: mapreduce.tasktracker.cache.local.size; Ignoring. 14/04/30 22:43:16 WARN Configuration: yarn-site.xml:an attempt to override final parameter: mapreduce.output.fileoutputformat.compress.type; Ignoring. 14/04/30 22:43:16 WARN Configuration: hdfs-site.xml:an attempt to override final parameter: mapreduce.map.output.compress.codec; Ignoring. log: org.apache.spark.rdd.RDD[String] = MappedRDD[3] at textFile at :12 scala> log.count() 14/04/30 22:43:03 WARN JobConf: The variable mapred.child.ulimit is no longer used. 14/04/30 22:43:04 WARN Configuration: mapred-default.xml:an attempt to override final parameter: mapreduce.tasktracker.cache.local.size; Ignoring. 14/04/30 22:43:04 WARN Configuration: yarn-site.xml:an attempt to override final parameter: mapreduce.output.fileoutputformat.compress.type; Ignoring. 14/04/30 22:43:04 WARN Configuration: hdfs-site.xml:an attempt to override final parameter: mapreduce.map.output.compress.codec; Ignoring. java.lang.IllegalArgumentException: java.net.UnknownHostException: jobs-a-hnn1 at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:237) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:141) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:576) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:521) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:146) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.SparkContext.runJob(SparkContext.scala:902) at org.apache.spark.rdd.RDD.count(RDD.scala:720) at $iwC$$iwC$$iwC$$iwC.(:15) at $iwC$$iwC$$iwC.(:20) at $iwC$$iwC.(:22) at $iwC.(:24) at (:26) at .(:30) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:795) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:840) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:752) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:600) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:607) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:610) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:935) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883) at scala.tools.nsc.util.ScalaC
Re: JMX with Spark
Hello Folks, Sorry for the delay, these emails got missed due to the volume. Here is my metrics.conf root@jobs-ab-hdn4:~# cat /opt/klout/spark/conf/metrics.conf # syntax: [instance].sink|source.[name].[options]=[value] # This file configures Spark's internal metrics system. The metrics system is # divided into instances which correspond to internal components. # Each instance can be configured to report its metrics to one or more sinks. # Accepted values for [instance] are "master", "worker", "executor", "driver", # and "applications". A wild card "*" can be used as an instance name, in # which case all instances will inherit the supplied property. # # Within an instance, a "source" specifies a particular set of grouped metrics. # there are two kinds of sources: #1. Spark internal sources, like MasterSource, WorkerSource, etc, which will #collect a Spark component's internal state. Each instance is paired with a #Spark source that is added automatically. #2. Common sources, like JvmSource, which will collect low level state. #These can be added through configuration options and are then loaded #using reflection. # # A "sink" specifies where metrics are delivered to. Each instance can be # assigned one or more sinks. # # The sink|source field specifies whether the property relates to a sink or # source. # # The [name] field specifies the name of source or sink. # # The [options] field is the specific property of this source or sink. The # source or sink is responsible for parsing this property. # # Notes: #1. To add a new sink, set the "class" option to a fully qualified class #name (see examples below). #2. Some sinks involve a polling period. The minimum allowed polling period #is 1 second. #3. Wild card properties can be overridden by more specific properties. #For example, master.sink.console.period takes precedence over #*.sink.console.period. #4. A metrics specific configuration #"spark.metrics.conf=${SPARK_HOME}/conf/metrics.properties" should be #added to Java properties using -Dspark.metrics.conf=xxx if you want to #customize metrics system. You can also put the file in ${SPARK_HOME}/conf #and it will be loaded automatically. #5. MetricsServlet is added by default as a sink in master, worker and client #driver, you can send http request "/metrics/json" to get a snapshot of all the #registered metrics in json format. For master, requests "/metrics/master/json" and #"/metrics/applications/json" can be sent seperately to get metrics snapshot of #instance master and applications. MetricsServlet may not be configured by self. # ## List of available sinks and their properties. # org.apache.spark.metrics.sink.ConsoleSink # Name: Default: Description: # period 10 Poll period # unitsecondsUnits of poll period # org.apache.spark.metrics.sink.CSVSink # Name: Default: Description: # period10 Poll period # unit secondsUnits of poll period # directory /tmp Where to store CSV files # org.apache.spark.metrics.sink.GangliaSink # Name: Default: Description: # host NONE Hostname or multicast group of Ganglia server # port NONE Port of Ganglia server(s) # period10 Poll period # unit secondsUnits of poll period # ttl 1 TTL of messages sent by Ganglia # mode multicast Ganglia network mode ('unicast' or 'mulitcast') #org.apache.spark.metrics.sink.JmxSink # org.apache.spark.metrics.sink.MetricsServlet # Name: Default: Description: # path VARIES*Path prefix from the web server root # samplefalse Whether to show entire set of samples for histograms ('false' or 'true') # # * Default path is /metrics/json for all instances except the master. The master has two paths: # /metrics/aplications/json # App information # /metrics/master/json # Master information # org.apache.spark.metrics.sink.GraphiteSink # Name: Default: Description: # host NONE Hostname of Graphite server # port NONE Port of Graphite server # period10Poll period # unit seconds Units of poll period # prefixEMPTY STRING Prefix to prepend to metric name ## Examples # Enable JmxSink for all instances by class name *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink # Enable ConsoleSink for all instances by class name #*.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink # Polling period for ConsoleSink #*.sink.console.period=10 #*.sink.console.unit=seconds # Master instance overlap polling period #master.sink.console.period=15 #master.sink.console.unit=seconds # Enable CsvSink for all instances #*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink # Polling period for CsvSink #*.sink.csv.period=1 #*.sink.csv.unit=minutes # Polling directory for CsvSink
JMX with Spark
Has anyone got this working? I have enabled the properties for it in the metrics.conf file and ensure that it is placed under spark's home directory. Any ideas why I don't see spark beans ?
Re: Can't run a simple spark application with 0.9.1
I am a dork please disregard this issue. I did not have the slaves correctly configured. This error is very misleading On Tue, Apr 15, 2014 at 11:21 AM, Paul Schooss wrote: > Hello, > > Currently I deployed 0.9.1 spark using a new way of starting up spark > > exec start-stop-daemon --start --pidfile /var/run/spark.pid > --make-pidfile --chuid ${SPARK_USER}:${SPARK_GROUP} --chdir ${SPARK_HOME} > --exec /usr/bin/java -- -cp ${CLASSPATH} > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port=10111 > -Dspark.akka.logLifecycleEvents=true -Djava.library.path= > -XX:ReservedCodeCacheSize=512M -XX:+UseCodeCacheFlushing > -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC > -Dspark.executor.memory=10G -Xmx10g ${MAIN_CLASS} ${MAIN_CLASS_ARGS} > > > where class path points to the spark jar that we compile with sbt. When I > try to run a job I receive the following warning > > WARN TaskSchedulerImpl: Initial job has not accepted any resources; check > your cluster UI to ensure that workers are registered and have sufficient > memory > > > My first question is do I need the entire spark project on disk in order > to run jobs? Or what else am I doing wrong? >
Can't run a simple spark application with 0.9.1
Hello, Currently I deployed 0.9.1 spark using a new way of starting up spark exec start-stop-daemon --start --pidfile /var/run/spark.pid --make-pidfile --chuid ${SPARK_USER}:${SPARK_GROUP} --chdir ${SPARK_HOME} --exec /usr/bin/java -- -cp ${CLASSPATH} -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=10111 -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -XX:ReservedCodeCacheSize=512M -XX:+UseCodeCacheFlushing -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -Dspark.executor.memory=10G -Xmx10g ${MAIN_CLASS} ${MAIN_CLASS_ARGS} where class path points to the spark jar that we compile with sbt. When I try to run a job I receive the following warning WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory My first question is do I need the entire spark project on disk in order to run jobs? Or what else am I doing wrong?
Re: Spark 0.9.1 - How to run bin/spark-class with my own hadoop jar files?
Andrew, I ran into the same problem and eventually settled on just running the jars directly with java. Since we use sbt to build our jars we had all the dependancies builtin to the jar it self so need for random class paths. On Tue, Mar 25, 2014 at 1:47 PM, Andrew Lee wrote: > Hi All, > > I'm getting the following error when I execute start-master.sh which also > invokes spark-class at the end. > > Failed to find Spark assembly in /root/spark/assembly/target/scala-2.10/ > > You need to build Spark with 'sbt/sbt assembly' before running this > program. > > > After digging into the code, I see the CLASSPATH is hardcoded with " > spark-assembly.*hadoop.*.jar". > > In bin/spark-class : > > > if [ ! -f "$FWDIR/RELEASE" ]; then > > # Exit if the user hasn't compiled Spark > > * num_jars=$(ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/ | grep > "spark-assembly.*hadoop.*.jar" | wc -l)* > > * jars_list=$(ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/ | grep > "spark-assembly.*hadoop.*.jar")* > > if [ "$num_jars" -eq "0" ]; then > > echo "Failed to find Spark assembly in > $FWDIR/assembly/target/scala-$SCALA_VERSION/" >&2 > > echo "You need to build Spark with 'sbt/sbt assembly' before running > this program." >&2 > > exit 1 > > fi > > if [ "$num_jars" -gt "1" ]; then > > echo "Found multiple Spark assembly jars in > $FWDIR/assembly/target/scala-$SCALA_VERSION:" >&2 > > echo "$jars_list" > > echo "Please remove all but one jar." > > exit 1 > > fi > > fi > > > Is there any reason why this is only grabbing spark-assembly.**hadoop*.*.jar > ? I am trying to run Spark that links to my own version of Hadoop under > /opt/hadoop23/, > > and I use 'sbt/sbt clean package' to build the package without the Hadoop > jar. What is the correct way to link to my own Hadoop jar? > > > >
NoClassFound Errors with Streaming Twitter
Hello Folks, We have a strange issue going on with a spark standalone cluster in which a simple test application is having a hard time using external classes. Here are the details The application is located here: https://github.com/prantik/spark-example We use classes such as spark's streaming twitter and twitter4j to stream the twitter hose. We use sbt to build the jar that we execute against the cluster. We have verified that the jar contains these classes. jar tf /opt/SimpleProject-assembly-1.1.jar | grep twitter4j/Status twitter4j/Status.class jar tf /opt/SimpleProject-assembly-1.1.jar | grep twitter/TwitterRec org/apache/spark/streaming/twitter/TwitterReceiver$$anon$1.class org/apache/spark/streaming/twitter/TwitterReceiver$$anonfun$onStart$1.class org/apache/spark/streaming/twitter/TwitterReceiver$$anonfun$onStop$1.class org/apache/spark/streaming/twitter/TwitterReceiver.class Also we ensure that even the slaves have class paths set to reference these libraries. ps auxww | grep spark on a slave /usr/bin/java -cp /opt/spark/external/twitter/target/spark-streaming-twitter_2.10-0.9.0-incubating.jar:/opt/spark/tools/target/spark-tools_2.10-0.9.0-incubating.jar:/opt/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar However we encounter the following error when running the application the slaves 14/03/13 21:20:42 INFO Executor: Running task ID 74 14/03/13 21:20:42 ERROR Executor: Exception in task ID 74 java.lang.ClassNotFoundException: twitter4j.Status We don't know how to address the class not being found. Any ideas? Regards, Paul
Re: Applications for Spark on HDFS
Thanks Sandy, I have not taken advantage of that yet but will research how to invoke that option when submitting the application to the spark master. Currently I am running a standalone spark master and using the run-class script to invoke the application we crafted as a test. On Tue, Mar 11, 2014 at 5:09 PM, Sandy Ryza wrote: > Hi Paul, > > What do you mean by distributing the jars manually? If you register jars > that are local to the client with SparkContext.addJars, Spark should handle > distributing them to the workers. Are you taking advantage of this? > > -Sandy > > > On Tue, Mar 11, 2014 at 3:09 PM, Paul Schooss wrote: > >> Hello Folks, >> >> I was wondering if anyone had experience placing application jars for >> Spark onto HDFS. Currently I have distributing the jars manually and would >> love to source the jar via HDFS a la distributed caching with MR. Any >> ideas? >> >> Regards, >> >> Paul >> > >
Applications for Spark on HDFS
Hello Folks, I was wondering if anyone had experience placing application jars for Spark onto HDFS. Currently I have distributing the jars manually and would love to source the jar via HDFS a la distributed caching with MR. Any ideas? Regards, Paul