Hi, i am developing a project that needs to use kafka, spark-streaming and
spark-mllib, this is the github project
<https://github.com/alonsoir/awesome-recommendation-engine/tree/develop>.
I am using a vmware cdh-5.7-0 image, with 4 cores and 8 GB of ram, the file
that i want to use is only 16 MB, if i finding problems related with
resources because the process outputs this message:
.set("spark.driver.allowMultipleContexts", "true")
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
16/06/03 11:58:09 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient resources
when i go to spark-master page, i can see this:
*Spark Master at spark://192.168.30.137:7077*
* URL: spark://192.168.30.137:7077*
* REST URL: spark://192.168.30.137:6066 (cluster mode)*
* Alive Workers: 0*
* Cores in use: 0 Total, 0 Used*
* Memory in use: 0.0 B Total, 0.0 B Used*
* Applications: 2 Running, 0 Completed*
* Drivers: 0 Running, 0 Completed*
* Status: ALIVE*
*Workers*
*Worker Id Address State Cores Memory*
*Running Applications*
*Application ID Name Cores Memory per Node Submitted Time User State
Duration*
*app-20160603115752-0001*
*(kill)*
* AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:52 cloudera WAITING 2.0
min*
*app-20160603115751-0000*
*(kill)*
* AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:51 cloudera WAITING 2.0
min*
And this is the spark-worker output:
*Spark Worker at 192.168.30.137:7078*
* ID: worker-20160603115937-192.168.30.137-7078*
* Master URL:*
* Cores: 4 (0 Used)*
* Memory: 6.7 GB (0.0 B Used)*
*Back to Master*
*Running Executors (0)*
*ExecutorID Cores State Memory Job Details Logs*
It is weird isn't ? master url is not set up and there is not any
ExecutorID, Cores, so on so forth...
If i do a ps xa | grep spark, this is the output:
[cloudera@quickstart bin]$ ps xa | grep spark
6330 ? Sl 0:11 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
/usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*
-Dspark.deploy.defaultCores=4 -Xms1g -Xmx1g -XX:MaxPermSize=256m
org.apache.spark.deploy.master.Master
6674 ? Sl 0:12 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
/etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*
-Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory
-Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m
org.apache.spark.deploy.history.HistoryServer
8153 pts/1 Sl+ 0:14 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
/home/cloudera/awesome-recommendation-engine/target/pack/lib/*
-Dprog.home=/home/cloudera/awesome-recommendation-engine/target/pack
-Dprog.version=1.0-SNAPSHOT example.spark.AmazonKafkaConnector
192.168.1.35:9092 amazonRatingsTopic
8413 ? Sl 0:04 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
/usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*
-Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker
spark://quickstart.cloudera:7077
8619 pts/3 S+ 0:00 grep spark
master is set up with four cores and 1 GB and worker has not any dedicated
core and it is using 1GB, that is weird isn't ? I have configured the
vmware image with 4 cores (from eight) and 8 GB (from 16).
This is how it looks my build.sbt:
libraryDependencies ++= Seq(
"org.apache.kafka" % "kafka_2.10" % "0.8.1"
exclude("javax.jms", "jms")
exclude("com.sun.jdmk", "jmxtools")
exclude("com.sun.jmx", "jmxri"),
//not working play module!! check
//jdbc,
//anorm,
//cache,
// HTTP client
"net.databinder.dispatch" %% "dispatch-core" % "0.11.1",
// HTML parser
"org.jodd" % "jodd-lagarto" % "3.5.2",
"com.typesafe" % "config" % "1.2.1",
"com.typesafe.play" % "play-json_2.10" % "2.4.0-M2",
"org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
"org.twitter4j" % "twitter4j-core" % "4.0.2",
"org.twitter4j" % "twitter4j-stream" % "4.0.2",
"org.codehaus.jackson" % "jackson-core-asl" % "1.6.1",
"org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test",
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.0-cdh5.7.0",
"org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.0",
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.0",
"org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.0",
"org.apache.spark" % "spark-mllib_2.10" % "1.6.0-cdh5.7.0",
"com.google.code.gson" % "gson" % "2.6.2",
"commons-cli" % "commons-cli" % "1.3.1",
"com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1",
// Akka
"com.typesafe.akka" %% "akka-actor" % akkaVersion,
"com.typesafe.akka" %% "akka-slf4j" % akkaVersion,
// MongoDB
"org.reactivemongo" %% "reactivemongo" % "0.10.0"
)
packAutoSettings
As you can see, i am using the exact version of spark modules for the
pseudo cluster and i want to use sbt-pack in order to create
an unix command, this is how i am declaring programmatically the spark
context :
val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
//.setMaster("local[4]")
.setMaster("spark://192.168.30.137:7077")
.set("spark.cores.max", "2")
...
val ratingFile= "hdfs://192.168.30.137:8020/user/cloudera/ratings.csv"
println("Using this ratingFile: " + ratingFile)
// first create an RDD out of the rating file
val rawTrainingRatings = sc.textFile(ratingFile).map {
line =>
val Array(userId, productId, scoreStr) = line.split(",")
AmazonRating(userId, productId, scoreStr.toDouble)
}
// only keep users that have rated between MinRecommendationsPerUser and
MaxRecommendationsPerUser products
//THIS IS THE LINE THAT PROVOKES the
*WARN TaskSchedulerImp*
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
*!*
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
val trainingRatings = rawTrainingRatings.groupBy(_.userId)
.filter(r =>
MinRecommendationsPerUser <= r._2.size && r._2.size <
MaxRecommendationsPerUser)
.flatMap(_._2)
.repartition(NumPartitions)
.cache()
println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out
of ${rawTrainingRatings.count()}")
My question is, do you see anything wrong with the code? is there anything
terrible wrong that i have to change? and,
what can i do to have this up and running with my resources?
What most annoys me is that the above code works perfectly in the console
spark of the virtual image but when I try to make it run
programmatically creating the unix with SBT-pack command does not work.
If the dedicated resources are too few to develop this project, what else
can i do? i mean, do i need to hire a tiny cluster with AWS
or any another provider? if that is a correct answer, which are yours
recommendation?
Thank you very much for reading until here.
Regards,
Alonso
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/About-a-problem-running-a-spark-job-in-a-cdh-5-7-0-vmware-image-tp27082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.