Hi, Spark works in local, standalone and yarn-client mode. Start as master = local. That is the simplest model.You DO not need to start $SPAK_HOME/sbin/start-master.sh and $SPAK_HOME/sbin/start-slaves.sh
Also you do not need to specify all that in spark-submit. In the Scala code you can do val sparkConf = new SparkConf(). setAppName("CEP_streaming_with_JDBC"). set("spark.driver.allowMultipleContexts", "true"). set("spark.hadoop.validateOutputSpecs", "false") And specify all that in spark-submit itself with minimum resources ${SPARK_HOME}/bin/spark-submit \ --packages com.databricks:spark-csv_2.11:1.3.0 \ --driver-memory 2G \ --num-executors 1 \ --executor-memory 2G \ --master local \ --executor-cores 2 \ --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ --jars /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \ --class "${FILE_NAME}" \ --class ${FILE_NAME} \ --conf "spark.ui.port=4040" \ ${JAR_FILE} The spark GUI UI port is 4040 (the default). Just track the progress of the job. You can specify your own port by replacing 4040 by a nom used port value Try it anyway. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 3 June 2016 at 11:39, Alonso <alons...@gmail.com> wrote: > Hi, i am developing a project that needs to use kafka, spark-streaming and > spark-mllib, this is the github project > <https://github.com/alonsoir/awesome-recommendation-engine/tree/develop>. > > I am using a vmware cdh-5.7-0 image, with 4 cores and 8 GB of ram, the > file that i want to use is only 16 MB, if i finding problems related with > resources because the process outputs this message: > > > .set("spark.driver.allowMultipleContexts", "true") > > <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> > 16/06/03 11:58:09 WARN TaskSchedulerImpl: Initial job has not accepted any > resources; check your cluster UI to ensure that workers are registered and > have sufficient resources > > > when i go to spark-master page, i can see this: > > > *Spark Master at spark://192.168.30.137:7077* > > * URL: spark://192.168.30.137:7077* > * REST URL: spark://192.168.30.137:6066 (cluster mode)* > * Alive Workers: 0* > * Cores in use: 0 Total, 0 Used* > * Memory in use: 0.0 B Total, 0.0 B Used* > * Applications: 2 Running, 0 Completed* > * Drivers: 0 Running, 0 Completed* > * Status: ALIVE* > > *Workers* > *Worker Id Address State Cores Memory* > *Running Applications* > *Application ID Name Cores Memory per Node Submitted Time User State > Duration* > *app-20160603115752-0001* > *(kill)* > * AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:52 cloudera WAITING > 2.0 min* > *app-20160603115751-0000* > *(kill)* > * AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:51 cloudera WAITING > 2.0 min* > > > And this is the spark-worker output: > > *Spark Worker at 192.168.30.137:7078* > > * ID: worker-20160603115937-192.168.30.137-7078* > * Master URL:* > * Cores: 4 (0 Used)* > * Memory: 6.7 GB (0.0 B Used)* > > *Back to Master* > *Running Executors (0)* > *ExecutorID Cores State Memory Job Details Logs* > > It is weird isn't ? master url is not set up and there is not any > ExecutorID, Cores, so on so forth... > > If i do a ps xa | grep spark, this is the output: > > [cloudera@quickstart bin]$ ps xa | grep spark > 6330 ? Sl 0:11 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp > /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* > -Dspark.deploy.defaultCores=4 -Xms1g -Xmx1g -XX:MaxPermSize=256m > org.apache.spark.deploy.master.Master > > 6674 ? Sl 0:12 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp > /etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* > -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory > -Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m > org.apache.spark.deploy.history.HistoryServer > > 8153 pts/1 Sl+ 0:14 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp > /home/cloudera/awesome-recommendation-engine/target/pack/lib/* > -Dprog.home=/home/cloudera/awesome-recommendation-engine/target/pack > -Dprog.version=1.0-SNAPSHOT example.spark.AmazonKafkaConnector > 192.168.1.35:9092 amazonRatingsTopic > > 8413 ? Sl 0:04 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp > /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* > -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker > spark://quickstart.cloudera:7077 > > 8619 pts/3 S+ 0:00 grep spark > > master is set up with four cores and 1 GB and worker has not any dedicated > core and it is using 1GB, that is weird isn't ? I have configured the > vmware image with 4 cores (from eight) and 8 GB (from 16). > > This is how it looks my build.sbt: > > libraryDependencies ++= Seq( > "org.apache.kafka" % "kafka_2.10" % "0.8.1" > exclude("javax.jms", "jms") > exclude("com.sun.jdmk", "jmxtools") > exclude("com.sun.jmx", "jmxri"), > //not working play module!! check > //jdbc, > //anorm, > //cache, > // HTTP client > "net.databinder.dispatch" %% "dispatch-core" % "0.11.1", > // HTML parser > "org.jodd" % "jodd-lagarto" % "3.5.2", > "com.typesafe" % "config" % "1.2.1", > "com.typesafe.play" % "play-json_2.10" % "2.4.0-M2", > "org.scalatest" % "scalatest_2.10" % "2.2.1" % "test", > "org.twitter4j" % "twitter4j-core" % "4.0.2", > "org.twitter4j" % "twitter4j-stream" % "4.0.2", > "org.codehaus.jackson" % "jackson-core-asl" % "1.6.1", > "org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test", > "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.0-cdh5.7.0", > "org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.0", > "org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.0", > "org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.0", > "org.apache.spark" % "spark-mllib_2.10" % "1.6.0-cdh5.7.0", > "com.google.code.gson" % "gson" % "2.6.2", > "commons-cli" % "commons-cli" % "1.3.1", > "com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1", > // Akka > "com.typesafe.akka" %% "akka-actor" % akkaVersion, > "com.typesafe.akka" %% "akka-slf4j" % akkaVersion, > // MongoDB > "org.reactivemongo" %% "reactivemongo" % "0.10.0" > ) > > packAutoSettings > > As you can see, i am using the exact version of spark modules for the > pseudo cluster and i want to use sbt-pack in order to create > an unix command, this is how i am declaring programmatically the spark > context : > > > val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector") > //.setMaster("local[4]") > > .setMaster("spark://192.168.30.137:7077") > .set("spark.cores.max", "2") > > ... > > val ratingFile= "hdfs://192.168.30.137:8020/user/cloudera/ratings.csv" > > > println("Using this ratingFile: " + ratingFile) > // first create an RDD out of the rating file > val rawTrainingRatings = sc.textFile(ratingFile).map { > line => > val Array(userId, productId, scoreStr) = line.split(",") > AmazonRating(userId, productId, scoreStr.toDouble) > } > > // only keep users that have rated between MinRecommendationsPerUser and > MaxRecommendationsPerUser products > > > //THIS IS THE LINE THAT PROVOKES the > *WARN TaskSchedulerImp* > > <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> > > <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> > *!* > > > <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> > val trainingRatings = rawTrainingRatings.groupBy(_.userId) > .filter(r => > MinRecommendationsPerUser <= r._2.size && r._2.size < > MaxRecommendationsPerUser) > .flatMap(_._2) > .repartition(NumPartitions) > .cache() > > println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings > out of ${rawTrainingRatings.count()}") > > My question is, do you see anything wrong with the code? is there anything > terrible wrong that i have to change? and, > what can i do to have this up and running with my resources? > > What most annoys me is that the above code works perfectly in the console > spark of the virtual image but when I try to make it run > programmatically creating the unix with SBT-pack command does not work. > > If the dedicated resources are too few to develop this project, what else > can i do? i mean, do i need to hire a tiny cluster with AWS > or any another provider? if that is a correct answer, which are yours > recommendation? > > Thank you very much for reading until here. > > Regards, > > Alonso > > > > <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> > > ------------------------------ > View this message in context: About a problem running a spark job in a > cdh-5.7.0 vmware image. > <http://apache-spark-user-list.1001560.n3.nabble.com/About-a-problem-running-a-spark-job-in-a-cdh-5-7-0-vmware-image-tp27082.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >