Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

Alonso Isidoro Roman Mon, 06 Jun 2016 08:50:49 -0700

Hi, just to update the thread, i have just submited a simple wordcount job
using yarn using this command:


[cloudera@quickstart simple-word-count]$ spark-submit --class
com.example.Hello --master yarn --deploy-mode cluster --driver-memory
1024Mb --executor-memory 1G --executor-cores 1
target/scala-2.10/test_2.10-1.0.jar

and the process was submited to the cluster and finalized fine, i can see
the correct output. Now i know that the previous process havent enough
resources. Now it is a matter of tuning the process...

Running free command outputs this:


[cloudera@quickstart simple-word-count]$ free
             total       used       free     shared    buffers     cached
Mem:       8061104    6687044    1374060       3464       5796     484416
-/+ buffers/cache:    6196832    1864272
Swap:      8388604     687500    7701104

so, i can only use at least 1GB...


Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2016-06-06 12:03 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> have you tried master local that should work. This works as a test
>
> ${SPARK_HOME}/bin/spark-submit \
>                  --driver-memory 2G \
>                 --num-executors 1 \
>                 --executor-memory 2G \
>                 --master local[2] \
>                 --executor-cores 2 \
>                 --conf "spark.scheduler.mode=FAIR" \
>                 --conf
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps" \
>                 --jars
> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>                 --class
> "com.databricks.apps.twitter_classifier.${FILE_NAME}" \
>                 --conf "spark.ui.port=${SP}" \
>                 --conf "spark.kryoserializer.buffer.max=512" \
>                 ${JAR_FILE} \
>                 ${OUTPUT_DIRECTORY:-/tmp/tweets} \
>                 ${NUM_TWEETS_TO_COLLECT:-10000} \
>                 ${OUTPUT_FILE_INTERVAL_IN_SECS:-10} \
>                 ${OUTPUT_FILE_PARTITIONS_EACH_INTERVAL:-1} \
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 6 June 2016 at 10:28, Alonso Isidoro Roman <alons...@gmail.com> wrote:
>
>> Hi guys, i finally understand that i cannot use sbt-pack to use
>> programmatically  the spark-streaming job as unix commands, i have to use
>> yarn or mesos  in order to run the jobs.
>>
>> I have some doubts, if i run the spark streaming jogs as yarn client
>> mode, i am receiving this exception:
>>
>> [cloudera@quickstart ~]$ spark-submit --class
>> example.spark.AmazonKafkaConnectorWithMongo --master yarn --deploy-mode
>> client --driver-memory 4g --executor-memory 2g --executor-cores 3
>> /home/cloudera/awesome-recommendation-engine/target/scala-2.10/my-recommendation-spark-engine_2.10-1.0-SNAPSHOT.jar
>> 192.168.1.35:9092 amazonRatingsTopic
>> java.lang.ClassNotFoundException:
>> example.spark.AmazonKafkaConnectorWithMongo
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:270)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:175)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>> But, if i use cluster mode, i have that is job is accepted.
>>
>> [cloudera@quickstart ~]$ spark-submit --class
>> example.spark.AmazonKafkaConnectorWithMongo --master yarn --deploy-mode
>> cluster --driver-memory 4g --executor-memory 2g --executor-cores 2
>> /home/cloudera/awesome-recommendation-engine/target/scala-2.10/my-recommendation-spark-engine_2.10-1.0-SNAPSHOT.jar
>> 192.168.1.35:9092 amazonRatingsTopic
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 16/06/06 11:16:46 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes where
>> applicable
>> 16/06/06 11:16:46 INFO client.RMProxy: Connecting to ResourceManager at /
>> 0.0.0.0:8032
>> 16/06/06 11:16:46 INFO yarn.Client: Requesting a new application from
>> cluster with 1 NodeManagers
>> 16/06/06 11:16:46 INFO yarn.Client: Verifying our application has not
>> requested more than the maximum memory capability of the cluster (8192 MB
>> per container)
>> 16/06/06 11:16:46 INFO yarn.Client: Will allocate AM container, with 4505
>> MB memory including 409 MB overhead
>> 16/06/06 11:16:46 INFO yarn.Client: Setting up container launch context
>> for our AM
>> 16/06/06 11:16:46 INFO yarn.Client: Setting up the launch environment for
>> our AM container
>> 16/06/06 11:16:46 INFO yarn.Client: Preparing resources for our AM
>> container
>> 16/06/06 11:16:47 WARN shortcircuit.DomainSocketFactory: The
>> short-circuit local reads feature cannot be used because libhadoop cannot
>> be loaded.
>> 16/06/06 11:16:47 INFO yarn.Client: Uploading resource
>> file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar
>> ->
>> hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1465201086091_0006/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar
>> 16/06/06 11:16:47 INFO yarn.Client: Uploading resource
>> file:/home/cloudera/awesome-recommendation-engine/target/scala-2.10/my-recommendation-spark-engine_2.10-1.0-SNAPSHOT.jar
>> ->
>> hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1465201086091_0006/my-recommendation-spark-engine_2.10-1.0-SNAPSHOT.jar
>> 16/06/06 11:16:47 INFO yarn.Client: Uploading resource
>> file:/tmp/spark-8e5fe800-bed2-4173-bb11-d47b3ab3b621/__spark_conf__5840282197389631291.zip
>> ->
>> hdfs://quickstart.cloudera:8020/user/cloudera/.sparkStaging/application_1465201086091_0006/__spark_conf__5840282197389631291.zip
>> 16/06/06 11:16:47 INFO spark.SecurityManager: Changing view acls to:
>> cloudera
>> 16/06/06 11:16:47 INFO spark.SecurityManager: Changing modify acls to:
>> cloudera
>> 16/06/06 11:16:47 INFO spark.SecurityManager: SecurityManager:
>> authentication disabled; ui acls disabled; users with view permissions:
>> Set(cloudera); users with modify permissions: Set(cloudera)
>> 16/06/06 11:16:47 INFO yarn.Client: Submitting application 6 to
>> ResourceManager
>> 16/06/06 11:16:48 INFO impl.YarnClientImpl: Submitted application
>> application_1465201086091_0006
>> 16/06/06 11:16:49 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:49 INFO yarn.Client:
>> client token: N/A
>> diagnostics: N/A
>> ApplicationMaster host: N/A
>> ApplicationMaster RPC port: -1
>> queue: root.cloudera
>> start time: 1465204607993
>> final status: UNDEFINED
>> tracking URL:
>> http://quickstart.cloudera:8088/proxy/application_1465201086091_0006/
>> user: cloudera
>> 16/06/06 11:16:50 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:51 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:52 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:53 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:54 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:55 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:56 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:57 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:58 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:16:59 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:00 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:01 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:02 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:03 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:04 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:05 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:06 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:07 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:08 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:09 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:10 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:11 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:12 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:13 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:14 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:15 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:16 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:17 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:18 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> 16/06/06 11:17:19 INFO yarn.Client: Application report for
>> application_1465201086091_0006 (state: ACCEPTED)
>> ...
>>
>> If i try to push a product to the kafka topic (amazonRatingsTopic), the
>> kafka broker is living in my host machine (192.168.1.35:9092), i cannot
>> see nothing in the logs. I can see in
>> http://quickstart.cloudera:8888/jobbrowser/ that the job is accepted,
>> when i click on the application_id, i can see this:
>>
>> The application might not be running yet or there is no Node Manager or
>> Container available. This page will be automatically refreshed.
>>
>> even if i push data into the kafka topic. Another think i have noticed is
>> that spark-worker is dead after a few minutes that the job is accepted, i
>> have to restart it manually doing sudo service spark-worker restart.
>>
>> If i run jus command, i see this:
>>
>> [cloudera@quickstart ~]$ jps
>> 11904 SparkSubmit
>> 12890 Jps
>> 7271 sbt-launch.jar
>> [cloudera@quickstart ~]$
>>
>> I know that sbt-launch is the sbt command running in another terminal,
>> but,  ¿Are NameNode processes and DataNode should not appear?
>>
>> Thank you very much for reading until here.
>>
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>>
>> 2016-06-04 18:23 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>>
>>> Hi,
>>>
>>> Spark works in local, standalone and yarn-client mode. Start as master =
>>> local. That is the simplest model.You DO not need to start
>>> $SPAK_HOME/sbin/start-master.sh and $SPAK_HOME/sbin/start-slaves.sh
>>>
>>>
>>> Also you do not need to specify all that in spark-submit. In the Scala
>>> code you can do
>>>
>>> val sparkConf = new SparkConf().
>>>              setAppName("CEP_streaming_with_JDBC").
>>>              set("spark.driver.allowMultipleContexts", "true").
>>>              set("spark.hadoop.validateOutputSpecs", "false")
>>>
>>> And specify all that in spark-submit itself with minimum resources
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>>>                 --driver-memory 2G \
>>>                 --num-executors 1 \
>>>                 --executor-memory 2G \
>>>                 --master local \
>>>                 --executor-cores 2 \
>>>                 --conf
>>> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
>>> -XX:+PrintGCTimeStamps" \
>>>                 --jars
>>> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>>>                 --class "${FILE_NAME}" \
>>>                 --class ${FILE_NAME} \
>>>                 --conf "spark.ui.port=4040" \
>>>                 ${JAR_FILE}
>>>
>>> The spark GUI UI port is 4040 (the default). Just track the progress of
>>> the job. You can specify your own port by replacing 4040 by a nom used port
>>> value
>>>
>>> Try it anyway.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 3 June 2016 at 11:39, Alonso <alons...@gmail.com> wrote:
>>>
>>>> Hi, i am developing a project that needs to use kafka, spark-streaming
>>>> and spark-mllib, this is the github project
>>>> <https://github.com/alonsoir/awesome-recommendation-engine/tree/develop>
>>>> .
>>>>
>>>> I am using a vmware cdh-5.7-0 image, with 4 cores and 8 GB of ram, the
>>>> file that i want to use is only 16 MB, if i finding problems related with
>>>> resources because the process outputs this message:
>>>>
>>>>
>>>>  .set("spark.driver.allowMultipleContexts", "true")
>>>>
>>>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>>>> 16/06/03 11:58:09 WARN TaskSchedulerImpl: Initial job has not accepted
>>>> any resources; check your cluster UI to ensure that workers are registered
>>>> and have sufficient resources
>>>>
>>>>
>>>> when i go to spark-master page, i can see this:
>>>>
>>>>
>>>> *Spark Master at spark://192.168.30.137:7077*
>>>>
>>>> *    URL: spark://192.168.30.137:7077*
>>>> *    REST URL: spark://192.168.30.137:6066 (cluster mode)*
>>>> *    Alive Workers: 0*
>>>> *    Cores in use: 0 Total, 0 Used*
>>>> *    Memory in use: 0.0 B Total, 0.0 B Used*
>>>> *    Applications: 2 Running, 0 Completed*
>>>> *    Drivers: 0 Running, 0 Completed*
>>>> *    Status: ALIVE*
>>>>
>>>> *Workers*
>>>> *Worker Id Address State Cores Memory*
>>>> *Running Applications*
>>>> *Application ID Name Cores Memory per Node Submitted Time User State
>>>> Duration*
>>>> *app-20160603115752-0001*
>>>> *(kill)*
>>>> * AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:52 cloudera WAITING
>>>> 2.0 min*
>>>> *app-20160603115751-0000*
>>>> *(kill)*
>>>> * AmazonKafkaConnector 0 1024.0 MB 2016/06/03 11:57:51 cloudera WAITING
>>>> 2.0 min*
>>>>
>>>>
>>>> And this is the spark-worker output:
>>>>
>>>> *Spark Worker at 192.168.30.137:7078*
>>>>
>>>> *    ID: worker-20160603115937-192.168.30.137-7078*
>>>> *    Master URL:*
>>>> *    Cores: 4 (0 Used)*
>>>> *    Memory: 6.7 GB (0.0 B Used)*
>>>>
>>>> *Back to Master*
>>>> *Running Executors (0)*
>>>> *ExecutorID Cores State Memory Job Details Logs*
>>>>
>>>> It is weird isn't ? master url is not set up and there is not any
>>>> ExecutorID, Cores, so on so forth...
>>>>
>>>> If i do a ps xa | grep spark, this is the output:
>>>>
>>>> [cloudera@quickstart bin]$ ps xa | grep spark
>>>>  6330 ?        Sl     0:11 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
>>>> /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*
>>>> -Dspark.deploy.defaultCores=4 -Xms1g -Xmx1g -XX:MaxPermSize=256m
>>>> org.apache.spark.deploy.master.Master
>>>>
>>>>  6674 ?        Sl     0:12 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
>>>> /etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*
>>>> -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory
>>>> -Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m
>>>> org.apache.spark.deploy.history.HistoryServer
>>>>
>>>>  8153 pts/1    Sl+    0:14 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
>>>> /home/cloudera/awesome-recommendation-engine/target/pack/lib/*
>>>> -Dprog.home=/home/cloudera/awesome-recommendation-engine/target/pack
>>>> -Dprog.version=1.0-SNAPSHOT example.spark.AmazonKafkaConnector
>>>> 192.168.1.35:9092 amazonRatingsTopic
>>>>
>>>>  8413 ?        Sl     0:04 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp
>>>> /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*
>>>> -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker
>>>> spark://quickstart.cloudera:7077
>>>>
>>>>  8619 pts/3    S+     0:00 grep spark
>>>>
>>>> master is set up with four cores and 1 GB and worker has not any
>>>> dedicated core and it is using 1GB, that is weird isn't ? I have configured
>>>> the vmware image with 4 cores (from eight) and 8 GB (from 16).
>>>>
>>>> This is how it looks my build.sbt:
>>>>
>>>> libraryDependencies ++= Seq(
>>>>   "org.apache.kafka" % "kafka_2.10" % "0.8.1"
>>>>       exclude("javax.jms", "jms")
>>>>       exclude("com.sun.jdmk", "jmxtools")
>>>>       exclude("com.sun.jmx", "jmxri"),
>>>>    //not working play module!! check
>>>>    //jdbc,
>>>>    //anorm,
>>>>    //cache,
>>>>    // HTTP client
>>>>    "net.databinder.dispatch" %% "dispatch-core" % "0.11.1",
>>>>    // HTML parser
>>>>    "org.jodd" % "jodd-lagarto" % "3.5.2",
>>>>    "com.typesafe" % "config" % "1.2.1",
>>>>    "com.typesafe.play" % "play-json_2.10" % "2.4.0-M2",
>>>>    "org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
>>>>    "org.twitter4j" % "twitter4j-core" % "4.0.2",
>>>>    "org.twitter4j" % "twitter4j-stream" % "4.0.2",
>>>>    "org.codehaus.jackson" % "jackson-core-asl" % "1.6.1",
>>>>    "org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test",
>>>>    "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.0-cdh5.7.0",
>>>>    "org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.0",
>>>>    "org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.0",
>>>>    "org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.0",
>>>>    "org.apache.spark" % "spark-mllib_2.10" % "1.6.0-cdh5.7.0",
>>>>    "com.google.code.gson" % "gson" % "2.6.2",
>>>>    "commons-cli" % "commons-cli" % "1.3.1",
>>>>    "com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1",
>>>>    // Akka
>>>>    "com.typesafe.akka" %% "akka-actor" % akkaVersion,
>>>>    "com.typesafe.akka" %% "akka-slf4j" % akkaVersion,
>>>>    // MongoDB
>>>>    "org.reactivemongo" %% "reactivemongo" % "0.10.0"
>>>> )
>>>>
>>>> packAutoSettings
>>>>
>>>> As you can see, i am using the exact version of spark modules for the
>>>> pseudo cluster and i want to use sbt-pack in order to create
>>>> an unix command, this is how i am declaring programmatically the spark
>>>> context :
>>>>
>>>>
>>>> val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
>>>>                                    //.setMaster("local[4]")
>>>>
>>>>  .setMaster("spark://192.168.30.137:7077")
>>>>                                    .set("spark.cores.max", "2")
>>>>
>>>> ...
>>>>
>>>> val ratingFile= "hdfs://192.168.30.137:8020/user/cloudera/ratings.csv"
>>>>
>>>>
>>>> println("Using this ratingFile: " + ratingFile)
>>>>   // first create an RDD out of the rating file
>>>>   val rawTrainingRatings = sc.textFile(ratingFile).map {
>>>>     line =>
>>>>       val Array(userId, productId, scoreStr) = line.split(",")
>>>>       AmazonRating(userId, productId, scoreStr.toDouble)
>>>>   }
>>>>
>>>>   // only keep users that have rated between MinRecommendationsPerUser
>>>> and MaxRecommendationsPerUser products
>>>>
>>>>
>>>> //THIS IS THE LINE THAT PROVOKES the
>>>> *WARN TaskSchedulerImp*
>>>>
>>>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>>>>
>>>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>>>> *!*
>>>>
>>>>
>>>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>>>> val trainingRatings = rawTrainingRatings.groupBy(_.userId)
>>>>                                           .filter(r =>
>>>> MinRecommendationsPerUser <= r._2.size  && r._2.size <
>>>> MaxRecommendationsPerUser)
>>>>                                           .flatMap(_._2)
>>>>                                           .repartition(NumPartitions)
>>>>                                           .cache()
>>>>
>>>>   println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings
>>>> out of ${rawTrainingRatings.count()}")
>>>>
>>>> My question is, do you see anything wrong with the code? is there
>>>> anything terrible wrong that i have to change? and,
>>>> what can i do to have this up and running with my resources?
>>>>
>>>> What most annoys me is that the above code works perfectly in the
>>>> console spark of the virtual image but when I try to make it run
>>>> programmatically creating the unix with SBT-pack command does not work.
>>>>
>>>> If the dedicated resources are too few to develop this project, what
>>>> else can i do? i mean, do i need to hire a tiny cluster with AWS
>>>> or any another provider? if that is a correct answer, which are yours
>>>> recommendation?
>>>>
>>>> Thank you very much for reading until here.
>>>>
>>>> Regards,
>>>>
>>>> Alonso
>>>>
>>>>
>>>>
>>>> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>
>>>>
>>>> ------------------------------
>>>> View this message in context: About a problem running a spark job in a
>>>> cdh-5.7.0 vmware image.
>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/About-a-problem-running-a-spark-job-in-a-cdh-5-7-0-vmware-image-tp27082.html>
>>>> Sent from the Apache Spark User List mailing list archive
>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>>>
>>>
>>>
>>
>

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

Reply via email to