+1 for the misleading error. Messages about failing to connect often mean that 
an executor has died. If so, dig into the executor logs and find out why the 
executor died (out of memory, perhaps). 

Andrew

> On Jul 23, 2016, at 11:39 AM, VG <vlin...@gmail.com> wrote:
> 
> Hi Pedro,
> 
> Based on your suggestion, I deployed this on a aws node and it worked fine. 
> thanks for your advice. 
> 
> I am still trying to figure out the issues on the local environment
> Anyways thanks again
> 
> -VG
> 
> On Sat, Jul 23, 2016 at 9:26 PM, Pedro Rodriguez <ski.rodrig...@gmail.com 
> <mailto:ski.rodrig...@gmail.com>> wrote:
> Have you changed spark-env.sh or spark-defaults.conf from the default? It 
> looks like spark is trying to address local workers based on a network 
> address (eg 192.168……) instead of on localhost (localhost, 127.0.0.1, 
> 0.0.0.0,…). Additionally, that network address doesn’t resolve correctly. You 
> might also check /etc/hosts to make sure that you don’t have anything weird 
> going on.
> 
> Last thing to try perhaps is that are you running Spark within a VM and/or 
> Docker? If networking isn’t setup correctly on those you may also run into 
> trouble.
> 
> What would be helpful is to know everything about your setup that might 
> affect networking.
> 
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
> 
> pedrorodriguez.io <http://pedrorodriguez.io/> | 909-353-4423 
> <tel:909-353-4423>
> github.com/EntilZha <http://github.com/EntilZha> | LinkedIn 
> <https://www.linkedin.com/in/pedrorodriguezscience>
> On July 23, 2016 at 9:10:31 AM, VG (vlin...@gmail.com 
> <mailto:vlin...@gmail.com>) wrote:
> 
>> Hi pedro,
>> 
>> Apologies for not adding this earlier. 
>> 
>> This is running on a local cluster set up as follows.
>> JavaSparkContext jsc = new JavaSparkContext("local[2]", "DR");
>> 
>> Any suggestions based on this ? 
>> 
>> The ports are not blocked by firewall. 
>> 
>> Regards,
>> 
>> 
>> 
>> On Sat, Jul 23, 2016 at 8:35 PM, Pedro Rodriguez <ski.rodrig...@gmail.com 
>> <mailto:ski.rodrig...@gmail.com>> wrote:
>> Make sure that you don’t have ports firewalled. You don’t really give much 
>> information to work from, but it looks like the master can’t access the 
>> worker nodes for some reason. If you give more information on the cluster, 
>> networking, etc, it would help.
>> 
>> For example, on AWS you can create a security group which allows all traffic 
>> to/from itself to itself. If you are using something like ufw on ubuntu then 
>> you probably need to know the ip addresses of the worker nodes beforehand.
>> 
>> —
>> Pedro Rodriguez
>> PhD Student in Large-Scale Machine Learning | CU Boulder
>> Systems Oriented Data Scientist
>> UC Berkeley AMPLab Alumni
>> 
>> pedrorodriguez.io <http://pedrorodriguez.io/> | 909-353-4423 
>> <tel:909-353-4423>
>> github.com/EntilZha <http://github.com/EntilZha> | LinkedIn 
>> <https://www.linkedin.com/in/pedrorodriguezscience>
>> On July 23, 2016 at 7:38:01 AM, VG (vlin...@gmail.com 
>> <mailto:vlin...@gmail.com>) wrote:
>> 
>>> Please suggest if I am doing something wrong or an alternative way of doing 
>>> this. 
>>> 
>>> I have an RDD with two values as follows 
>>> JavaPairRDD<String, Long> rdd
>>> 
>>> When I execute   rdd..collectAsMap()
>>> it always fails with IO exceptions.   
>>> 
>>> 
>>> 16/07/23 19:03:58 ERROR RetryingBlockFetcher: Exception while beginning 
>>> fetch of 1 outstanding blocks 
>>> java.io.IOException: Failed to connect to /192.168.1.3:58179 
>>> <http://192.168.1.3:58179/>
>>> at 
>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>>> at 
>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>>> at 
>>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>>> at 
>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>>> at 
>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>>> at 
>>> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:105)
>>> at 
>>> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:92)
>>> at 
>>> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:546)
>>> at 
>>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:76)
>>> at 
>>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>>> at 
>>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>>> at 
>>> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>> at java.lang.Thread.run(Unknown Source)
>>> Caused by: java.net.ConnectException: Connection timed out: no further 
>>> information: /192.168.1.3:58179 <http://192.168.1.3:58179/>
>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>>> at 
>>> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>>> at 
>>> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>>> at 
>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>>> at 
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>>> at 
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>> at 
>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>>> ... 1 more
>>> 16/07/23 19:03:58 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 
>>> outstanding blocks after 5000 ms
>>> 
>>> 
>>> 
>> 
> 

Reply via email to