Re: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-06-01 Thread Alonso Isidoro Roman
Thank you David, i will try to follow your advise.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-05-31 21:28 GMT+02:00 David Newberger :

> Have you tried it without either of the setMaster lines?
>
>
> Also, CDH 5.7 uses spark 1.6.0 with some patches. I would recommend using
> the cloudera repo for spark files in build sbt. I’d also check other files
> in the build sbt to see if there are cdh specific versions.
>
>
>
> *David Newberger*
>
>
>
> *From:* Alonso Isidoro Roman [mailto:alons...@gmail.com]
> *Sent:* Tuesday, May 31, 2016 1:23 PM
> *To:* David Newberger
> *Cc:* user@spark.apache.org
> *Subject:* Re: About a problem when mapping a file located within a HDFS
> vmware cdh-5.7 image
>
>
>
> Hi David, the one of the develop branch. I think It should be the same,
> but actually not sure...
>
>
>
> Regards
>
>
> *Alonso Isidoro Roman*
>
> about.me/alonso.isidoro.roman
>
>
>
> 2016-05-31 19:40 GMT+02:00 David Newberger :
>
> Is
> https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt
>   the build.sbt you are using?
>
>
>
> *David Newberger*
>
> QA Analyst
>
> *WAND*  -  *The Future of Restaurant Technology*
>
> (W)  www.wandcorp.com
>
> (E)   david.newber...@wandcorp.com
>
> (P)   952.361.6200
>
>
>
> *From:* Alonso [mailto:alons...@gmail.com]
> *Sent:* Tuesday, May 31, 2016 11:11 AM
> *To:* user@spark.apache.org
> *Subject:* About a problem when mapping a file located within a HDFS
> vmware cdh-5.7 image
>
>
>
> I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using
> OS X as my development machine, and the cdh image to run the code, i upload
> the code using git to the cdh image, i have modified my /etc/hosts file
> located in the cdh image with a line like this:
>
> 127.0.0.1   quickstart.cloudera quickstart  localhost   
> localhost.domain
>
>
>
> 192.168.30.138   quickstart.cloudera quickstart  localhost   
> localhost.domain
>
> The cloudera version that i am running is:
>
> [cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties
>
>
>
> # Autogenerated build properties
>
> version=2.6.0-cdh5.7.0
>
> git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1
>
> cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76
>
> cloudera.base-branch=cdh5-base-2.6.0
>
> cloudera.build-branch=cdh5-2.6.0_5.7.0
>
> cloudera.pkg.version=2.6.0+cdh5.7.0+1280
>
> cloudera.pkg.release=1.cdh5.7.0.p0.92
>
> cloudera.cdh.release=cdh5.7.0
>
> cloudera.build.time=2016.03.23-18:30:29GMT
>
> I can do a ls command in the vmware machine:
>
> [cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
>
> -rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 
> /user/cloudera/ratings.csv
>
> I can read its content:
>
> [cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l
>
> 568454
>
> The code is quite simple, just trying to map its content:
>
> val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"
>
>
>
> case class AmazonRating(userId: String, productId: String, rating: Double)
>
>
>
> val NumRecommendations = 10
>
> val MinRecommendationsPerUser = 10
>
> val MaxRecommendationsPerUser = 20
>
> val MyUsername = "myself"
>
> val NumPartitions = 20
>
>
>
>
>
> println("Using this ratingFile: " + ratingFile)
>
>   // first create an RDD out of the rating file
>
> val rawTrainingRatings = sc.textFile(ratingFile).map {
>
> line =>
>
>   val Array(userId, productId, scoreStr) = line.split(",")
>
>   AmazonRating(userId, productId, scoreStr.toDouble)
>
> }
>
>
>
>   // only keep users that have rated between MinRecommendationsPerUser and 
> MaxRecommendationsPerUser products
>
> val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => 
> MinRecommendationsPerUser <= r._2.size  && r._2.size < 
> MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()
>
>
>
> println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of 
> ${rawTrainingRatings.count()}")
>
> I am getting this message:
>
> Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 
> ratings out of 568454
>
> because if i run the exact code within the spark-shell, i got this message:
>
> Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 
> ratings out of 568454
>
> *Why is it working fine within the spark-shell but it is not running
> fine programmatically  in the vmware image?*
>
> I am running the code using sbt-pack plugin to generate unix commands and
> run them within the vmware image which has the spark pseudocluster,
>
> This is the code i use to instantiate the sparkconf:
>
> val 

RE: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-05-31 Thread David Newberger
Have you tried it without either of the setMaster lines?

Also, CDH 5.7 uses spark 1.6.0 with some patches. I would recommend using the 
cloudera repo for spark files in build sbt. I’d also check other files in the 
build sbt to see if there are cdh specific versions.

David Newberger

From: Alonso Isidoro Roman [mailto:alons...@gmail.com]
Sent: Tuesday, May 31, 2016 1:23 PM
To: David Newberger
Cc: user@spark.apache.org
Subject: Re: About a problem when mapping a file located within a HDFS vmware 
cdh-5.7 image

Hi David, the one of the develop branch. I think It should be the same, but 
actually not sure...

Regards

Alonso Isidoro Roman
about.me/alonso.isidoro.roman


2016-05-31 19:40 GMT+02:00 David Newberger 
>:
Is 
https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt 
  the build.sbt you are using?

David Newberger
QA Analyst
WAND  -  The Future of Restaurant Technology
(W)  www.wandcorp.com
(E)   david.newber...@wandcorp.com
(P)   952.361.6200

From: Alonso [mailto:alons...@gmail.com]
Sent: Tuesday, May 31, 2016 11:11 AM
To: user@spark.apache.org
Subject: About a problem when mapping a file located within a HDFS vmware 
cdh-5.7 image


I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using OS X 
as my development machine, and the cdh image to run the code, i upload the code 
using git to the cdh image, i have modified my /etc/hosts file located in the 
cdh image with a line like this:

127.0.0.1   quickstart.cloudera quickstart  localhost   
localhost.domain



192.168.30.138   quickstart.cloudera quickstart  localhost   
localhost.domain

The cloudera version that i am running is:

[cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties



# Autogenerated build properties

version=2.6.0-cdh5.7.0

git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a

cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a

cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1

cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76

cloudera.base-branch=cdh5-base-2.6.0

cloudera.build-branch=cdh5-2.6.0_5.7.0

cloudera.pkg.version=2.6.0+cdh5.7.0+1280

cloudera.pkg.release=1.cdh5.7.0.p0.92

cloudera.cdh.release=cdh5.7.0

cloudera.build.time=2016.03.23-18:30:29GMT

I can do a ls command in the vmware machine:

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv

-rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 
/user/cloudera/ratings.csv

I can read its content:

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l

568454

The code is quite simple, just trying to map its content:

val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"



case class AmazonRating(userId: String, productId: String, rating: Double)



val NumRecommendations = 10

val MinRecommendationsPerUser = 10

val MaxRecommendationsPerUser = 20

val MyUsername = "myself"

val NumPartitions = 20





println("Using this ratingFile: " + ratingFile)

  // first create an RDD out of the rating file

val rawTrainingRatings = sc.textFile(ratingFile).map {

line =>

  val Array(userId, productId, scoreStr) = line.split(",")

  AmazonRating(userId, productId, scoreStr.toDouble)

}



  // only keep users that have rated between MinRecommendationsPerUser and 
MaxRecommendationsPerUser products

val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => 
MinRecommendationsPerUser <= r._2.size  && r._2.size < 
MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()



println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of 
${rawTrainingRatings.count()}")

I am getting this message:

Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 
ratings out of 568454

because if i run the exact code within the spark-shell, i got this message:

Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 
ratings out of 568454

Why is it working fine within the spark-shell but it is not running fine 
programmatically  in the vmware image?

I am running the code using sbt-pack plugin to generate unix commands and run 
them within the vmware image which has the spark pseudocluster,

This is the code i use to instantiate the sparkconf:

val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")

   
.setMaster("local[4]").set("spark.driver.allowMultipleContexts", "true")

val sc = new SparkContext(sparkConf)

val sqlContext = new SQLContext(sc)

val ssc = new StreamingContext(sparkConf, Seconds(2))

//this checkpointdir should be in a conf file, for now it is hardcoded!

val streamingCheckpointDir = 
"/home/cloudera/my-recommendation-spark-engine/checkpoint"


Re: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-05-31 Thread Alonso Isidoro Roman
Hi David, the one of the develop branch. I think It should be the same, but
actually not sure...

Regards

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-05-31 19:40 GMT+02:00 David Newberger :

> Is
> https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt
>   the build.sbt you are using?
>
>
>
> *David Newberger*
>
> QA Analyst
>
> *WAND*  -  *The Future of Restaurant Technology*
>
> (W)  www.wandcorp.com
>
> (E)   david.newber...@wandcorp.com
>
> (P)   952.361.6200
>
>
>
> *From:* Alonso [mailto:alons...@gmail.com]
> *Sent:* Tuesday, May 31, 2016 11:11 AM
> *To:* user@spark.apache.org
> *Subject:* About a problem when mapping a file located within a HDFS
> vmware cdh-5.7 image
>
>
>
> I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using
> OS X as my development machine, and the cdh image to run the code, i upload
> the code using git to the cdh image, i have modified my /etc/hosts file
> located in the cdh image with a line like this:
>
> 127.0.0.1   quickstart.cloudera quickstart  localhost   
> localhost.domain
>
>
>
> 192.168.30.138   quickstart.cloudera quickstart  localhost   
> localhost.domain
>
> The cloudera version that i am running is:
>
> [cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties
>
>
>
> # Autogenerated build properties
>
> version=2.6.0-cdh5.7.0
>
> git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
>
> cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1
>
> cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76
>
> cloudera.base-branch=cdh5-base-2.6.0
>
> cloudera.build-branch=cdh5-2.6.0_5.7.0
>
> cloudera.pkg.version=2.6.0+cdh5.7.0+1280
>
> cloudera.pkg.release=1.cdh5.7.0.p0.92
>
> cloudera.cdh.release=cdh5.7.0
>
> cloudera.build.time=2016.03.23-18:30:29GMT
>
> I can do a ls command in the vmware machine:
>
> [cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
>
> -rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 
> /user/cloudera/ratings.csv
>
> I can read its content:
>
> [cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l
>
> 568454
>
> The code is quite simple, just trying to map its content:
>
> val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"
>
>
>
> case class AmazonRating(userId: String, productId: String, rating: Double)
>
>
>
> val NumRecommendations = 10
>
> val MinRecommendationsPerUser = 10
>
> val MaxRecommendationsPerUser = 20
>
> val MyUsername = "myself"
>
> val NumPartitions = 20
>
>
>
>
>
> println("Using this ratingFile: " + ratingFile)
>
>   // first create an RDD out of the rating file
>
> val rawTrainingRatings = sc.textFile(ratingFile).map {
>
> line =>
>
>   val Array(userId, productId, scoreStr) = line.split(",")
>
>   AmazonRating(userId, productId, scoreStr.toDouble)
>
> }
>
>
>
>   // only keep users that have rated between MinRecommendationsPerUser and 
> MaxRecommendationsPerUser products
>
> val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => 
> MinRecommendationsPerUser <= r._2.size  && r._2.size < 
> MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()
>
>
>
> println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of 
> ${rawTrainingRatings.count()}")
>
> I am getting this message:
>
> Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 
> ratings out of 568454
>
> because if i run the exact code within the spark-shell, i got this message:
>
> Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 
> ratings out of 568454
>
> *Why is it working fine within the spark-shell but it is not running
> fine programmatically  in the vmware image?*
>
> I am running the code using sbt-pack plugin to generate unix commands and
> run them within the vmware image which has the spark pseudocluster,
>
> This is the code i use to instantiate the sparkconf:
>
> val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
>
>
> .setMaster("local[4]").set("spark.driver.allowMultipleContexts", "true")
>
> val sc = new SparkContext(sparkConf)
>
> val sqlContext = new SQLContext(sc)
>
> val ssc = new StreamingContext(sparkConf, Seconds(2))
>
> //this checkpointdir should be in a conf file, for now it is hardcoded!
>
> val streamingCheckpointDir = 
> "/home/cloudera/my-recommendation-spark-engine/checkpoint"
>
> ssc.checkpoint(streamingCheckpointDir)
>
> I have tried to use this way of setting spark master, but an exception
> raises, i suspect that this is symptomatic of my problem.
> //.setMaster("spark://quickstart.cloudera:7077")
>
> The exception when i try to use the 

RE: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-05-31 Thread David Newberger
Is 
https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt 
  the build.sbt you are using?

David Newberger
QA Analyst
WAND  -  The Future of Restaurant Technology
(W)  www.wandcorp.com
(E)   david.newber...@wandcorp.com
(P)   952.361.6200

From: Alonso [mailto:alons...@gmail.com]
Sent: Tuesday, May 31, 2016 11:11 AM
To: user@spark.apache.org
Subject: About a problem when mapping a file located within a HDFS vmware 
cdh-5.7 image


I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using OS X 
as my development machine, and the cdh image to run the code, i upload the code 
using git to the cdh image, i have modified my /etc/hosts file located in the 
cdh image with a line like this:

127.0.0.1   quickstart.cloudera quickstart  localhost   
localhost.domain



192.168.30.138   quickstart.cloudera quickstart  localhost   
localhost.domain

The cloudera version that i am running is:

[cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties



# Autogenerated build properties

version=2.6.0-cdh5.7.0

git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a

cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a

cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1

cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76

cloudera.base-branch=cdh5-base-2.6.0

cloudera.build-branch=cdh5-2.6.0_5.7.0

cloudera.pkg.version=2.6.0+cdh5.7.0+1280

cloudera.pkg.release=1.cdh5.7.0.p0.92

cloudera.cdh.release=cdh5.7.0

cloudera.build.time=2016.03.23-18:30:29GMT

I can do a ls command in the vmware machine:

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv

-rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 
/user/cloudera/ratings.csv

I can read its content:

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l

568454

The code is quite simple, just trying to map its content:

val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"



case class AmazonRating(userId: String, productId: String, rating: Double)



val NumRecommendations = 10

val MinRecommendationsPerUser = 10

val MaxRecommendationsPerUser = 20

val MyUsername = "myself"

val NumPartitions = 20





println("Using this ratingFile: " + ratingFile)

  // first create an RDD out of the rating file

val rawTrainingRatings = sc.textFile(ratingFile).map {

line =>

  val Array(userId, productId, scoreStr) = line.split(",")

  AmazonRating(userId, productId, scoreStr.toDouble)

}



  // only keep users that have rated between MinRecommendationsPerUser and 
MaxRecommendationsPerUser products

val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => 
MinRecommendationsPerUser <= r._2.size  && r._2.size < 
MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()



println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of 
${rawTrainingRatings.count()}")

I am getting this message:

Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 
ratings out of 568454

because if i run the exact code within the spark-shell, i got this message:

Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 
ratings out of 568454

Why is it working fine within the spark-shell but it is not running fine 
programmatically  in the vmware image?

I am running the code using sbt-pack plugin to generate unix commands and run 
them within the vmware image which has the spark pseudocluster,

This is the code i use to instantiate the sparkconf:

val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")

   
.setMaster("local[4]").set("spark.driver.allowMultipleContexts", "true")

val sc = new SparkContext(sparkConf)

val sqlContext = new SQLContext(sc)

val ssc = new StreamingContext(sparkConf, Seconds(2))

//this checkpointdir should be in a conf file, for now it is hardcoded!

val streamingCheckpointDir = 
"/home/cloudera/my-recommendation-spark-engine/checkpoint"

ssc.checkpoint(streamingCheckpointDir)

I have tried to use this way of setting spark master, but an exception raises, 
i suspect that this is symptomatic of my problem.  
//.setMaster("spark://quickstart.cloudera:7077")

The exception when i try to use the fully qualified domain name:

.setMaster("spark://quickstart.cloudera:7077")



java.io.IOException: Failed to connect to 
quickstart.cloudera/127.0.0.1:7077

at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)

at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)

at 
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)

at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)

at