Hi David, the one of the develop branch. I think It should be the same, but actually not sure...
Regards Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> 2016-05-31 19:40 GMT+02:00 David Newberger <david.newber...@wandcorp.com>: > Is > https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt > the build.sbt you are using? > > > > *David Newberger* > > QA Analyst > > *WAND* - *The Future of Restaurant Technology* > > (W) www.wandcorp.com > > (E) david.newber...@wandcorp.com > > (P) 952.361.6200 > > > > *From:* Alonso [mailto:alons...@gmail.com] > *Sent:* Tuesday, May 31, 2016 11:11 AM > *To:* user@spark.apache.org > *Subject:* About a problem when mapping a file located within a HDFS > vmware cdh-5.7 image > > > > I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using > OS X as my development machine, and the cdh image to run the code, i upload > the code using git to the cdh image, i have modified my /etc/hosts file > located in the cdh image with a line like this: > > 127.0.0.1 quickstart.cloudera quickstart localhost > localhost.domain > > > > 192.168.30.138 quickstart.cloudera quickstart localhost > localhost.domain > > The cloudera version that i am running is: > > [cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties > > > > # Autogenerated build properties > > version=2.6.0-cdh5.7.0 > > git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a > > cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a > > cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1 > > cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76 > > cloudera.base-branch=cdh5-base-2.6.0 > > cloudera.build-branch=cdh5-2.6.0_5.7.0 > > cloudera.pkg.version=2.6.0+cdh5.7.0+1280 > > cloudera.pkg.release=1.cdh5.7.0.p0.92 > > cloudera.cdh.release=cdh5.7.0 > > cloudera.build.time=2016.03.23-18:30:29GMT > > I can do a ls command in the vmware machine: > > [cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv > > -rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 > /user/cloudera/ratings.csv > > I can read its content: > > [cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l > > 568454 > > The code is quite simple, just trying to map its content: > > val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv" > > > > case class AmazonRating(userId: String, productId: String, rating: Double) > > > > val NumRecommendations = 10 > > val MinRecommendationsPerUser = 10 > > val MaxRecommendationsPerUser = 20 > > val MyUsername = "myself" > > val NumPartitions = 20 > > > > > > println("Using this ratingFile: " + ratingFile) > > // first create an RDD out of the rating file > > val rawTrainingRatings = sc.textFile(ratingFile).map { > > line => > > val Array(userId, productId, scoreStr) = line.split(",") > > AmazonRating(userId, productId, scoreStr.toDouble) > > } > > > > // only keep users that have rated between MinRecommendationsPerUser and > MaxRecommendationsPerUser products > > val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => > MinRecommendationsPerUser <= r._2.size && r._2.size < > MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache() > > > > println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of > ${rawTrainingRatings.count()}") > > I am getting this message: > > Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 > ratings out of 568454 > > because if i run the exact code within the spark-shell, i got this message: > > Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 > ratings out of 568454 > > *Why is it working fine within the spark-shell but it is not running > fine programmatically in the vmware image?* > > I am running the code using sbt-pack plugin to generate unix commands and > run them within the vmware image which has the spark pseudocluster, > > This is the code i use to instantiate the sparkconf: > > val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector") > > > .setMaster("local[4]").set("spark.driver.allowMultipleContexts", "true") > > val sc = new SparkContext(sparkConf) > > val sqlContext = new SQLContext(sc) > > val ssc = new StreamingContext(sparkConf, Seconds(2)) > > //this checkpointdir should be in a conf file, for now it is hardcoded! > > val streamingCheckpointDir = > "/home/cloudera/my-recommendation-spark-engine/checkpoint" > > ssc.checkpoint(streamingCheckpointDir) > > I have tried to use this way of setting spark master, but an exception > raises, i suspect that this is symptomatic of my problem. > //.setMaster("spark://quickstart.cloudera:7077") > > The exception when i try to use the fully qualified domain name: > > .setMaster("spark://quickstart.cloudera:7077") > > > > java.io.IOException: Failed to connect to quickstart.cloudera/127.0.0.1:7077 > > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > > at > org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200) > > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) > > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183) > > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.net.ConnectException: Connection refused: > quickstart.cloudera/127.0.0.1:7077 > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > > I can ping to quickstart.cloudera in the cloudera terminal, so why i can't > use .setMaster("spark://quickstart.cloudera:7077") instead of > .setMaster("local[*]"): > > [cloudera@quickstart bin]$ ping quickstart.cloudera > > PING quickstart.cloudera (127.0.0.1) 56(84) bytes of data. > > 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=1 ttl=64 time=0.019 ms > > 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=2 ttl=64 time=0.026 ms > > 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=3 ttl=64 time=0.026 ms > > 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=4 ttl=64 time=0.028 ms > > 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=5 ttl=64 time=0.026 ms > > 64 bytes from quickstart.cloudera (127.0.0.1): icmp_seq=6 ttl=64 time=0.020 ms > > And the port 7077 is listening to incoming calls: > > [cloudera@quickstart bin]$ netstat -nap | grep 7077 > > (Not all processes could be identified, non-owned process info > > will not be shown, you would have to be root to see it all.) > > tcp 0 0 192.168.30.138:7077 0.0.0.0:* > LISTEN > > > > > > [cloudera@quickstart bin]$ ping 192.168.30.138 > > PING 192.168.30.138 (192.168.30.138) 56(84) bytes of data. > > 64 bytes from 192.168.30.138: icmp_seq=1 ttl=64 time=0.023 ms > > 64 bytes from 192.168.30.138: icmp_seq=2 ttl=64 time=0.026 ms > > 64 bytes from 192.168.30.138: icmp_seq=3 ttl=64 time=0.028 ms > > ^C > > --- 192.168.30.138 ping statistics --- > > 3 packets transmitted, 3 received, 0% packet loss, time 2810ms > > rtt min/avg/max/mdev = 0.023/0.025/0.028/0.006 ms > > [cloudera@quickstart bin]$ ifconfig > > eth2 Link encap:Ethernet HWaddr 00:0C:29:6F:80:D2 > > inet addr:192.168.30.138 Bcast:192.168.30.255 Mask:255.255.255.0 > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:8612 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:8493 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:2917515 (2.7 MiB) TX bytes:849750 (829.8 KiB) > > > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > UP LOOPBACK RUNNING MTU:65536 Metric:1 > > RX packets:57534 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:57534 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:44440656 (42.3 MiB) TX bytes:44440656 (42.3 MiB) > > I think that this must be a misconfiguration in a cloudera configuration > file, but which one? > > Thank you very much for reading until here. > > *Alonso Isidoro Roman* > > about.me/alonso.isidoro.roman > > > ------------------------------ > > View this message in context: About a problem when mapping a file located > within a HDFS vmware cdh-5.7 image > <http://apache-spark-user-list.1001560.n3.nabble.com/About-a-problem-when-mapping-a-file-located-within-a-HDFS-vmware-cdh-5-7-image-tp27058.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >