DAGClientRPCServer is for client service, not for TezChild. You need look
at "Instantiated TaskAttemptListener RPC at"

On Tue, Jul 21, 2015 at 10:21 AM, Rajat Jain <[email protected]> wrote:

> Here are the AM logs:
>
> 2015-07-21 17:08:14,279 INFO [ServiceThread:DAGClientRPCServer] 
> ipc.CallQueueManager: Using callQueue class 
> java.util.concurrent.LinkedBlockingQueue
> 2015-07-21 17:08:14,285 INFO 
> [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] 
> ipc.CallQueueManager: Using callQueue class 
> java.util.concurrent.LinkedBlockingQueue
> 2015-07-21 17:08:14,299 INFO [Socket Reader #1 for port 46373] ipc.Server: 
> Starting Socket Reader #1 for port 46373
> 2015-07-21 17:08:14,300 INFO [Socket Reader #1 for port 37949] ipc.Server: 
> Starting Socket Reader #1 for port 37949
> 2015-07-21 17:08:14,358 INFO [IPC Server Responder] ipc.Server: IPC Server 
> Responder: starting
> 2015-07-21 17:08:14,364 INFO [IPC Server listener on 46373] ipc.Server: IPC 
> Server listener on 46373: starting
> 2015-07-21 17:08:14,364 INFO [IPC Server Responder] ipc.Server: IPC Server 
> Responder: starting
> 2015-07-21 17:08:14,365 INFO [IPC Server listener on 37949] ipc.Server: IPC 
> Server listener on 37949: starting
> 2015-07-21 17:08:14,374 INFO [ServiceThread:DAGClientRPCServer] 
> client.DAGClientServer: Instantiated DAGClientRPCServer at 
> ip-10-16-141-168.ec2.internal/10.16.141.168:46373
> 2015-07-21 17:08:14,377 INFO [HistoryEventHandlingThread] 
> impl.SimpleHistoryLoggingService: Writing event AM_LAUNCHED to history file
>
>
> The interesting thing to note is the Tez Task is trying to connect to port
> 37949. The DAGClientRPCServer (which uses private DNS) is instantiated on
> 46373. But it also starts another IPC server on 37949 though I'm not sure
> what it is for.
>
> On Tue, Jul 21, 2015 at 10:13 AM, Rajat Jain <[email protected]> wrote:
>
>> Hi,
>>
>> I am running a yarn cluster on AWS. The slave nodes (NMs) are all
>> configured to listen on private DNS. For example, a sample node manager
>> listens on ip-10-16-141-168.ec2.internal:8042
>> <https://multicluster.qubole.net/cluster-proxy?encodedUrl=http%3A%2F%2Fip-10-16-141-168.ec2.internal%3A8042%2F>
>> .
>>
>> When I'm trying to run a Tez job (even simple ones like select count(*)
>> from nation) - they fail because child tasks are unable to connect to the
>> AM. The issue is they are trying to connect to the IP instead of the
>> private DNS. Here's a sample log line (couple of them added by me for
>> debugging):
>>
>> 2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting
>> 2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket factory 
>> class: org.apache.hadoop.net.StandardSocketFactory
>> 2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID, containerIdentifier: 
>>  3699, container_1437498369268_0001_01_000002
>> 2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation: 
>> fs.default.name is deprecated. Instead, use fs.defaultFS
>> 2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port: 
>> 10.16.141.168:37949
>> 2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables: 
>> 10.16.141.168:37949
>> 2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter: Attempting 
>> to fetch new task
>> 2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect to 
>> server: 10.16.141.168/10.16.141.168:37949. Already tried 0 time(s); retry 
>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
>> MILLISECONDS)
>> 2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect to 
>> server: 10.16.141.168/10.16.141.168:37949. Already tried 1 time(s); retry 
>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
>> MILLISECONDS)
>> 2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect to 
>> server: 10.16.141.168/10.16.141.168:37949. Already tried 2 time(s); retry 
>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
>> MILLISECONDS)
>> 2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect to 
>> server: 10.16.141.168/10.16.141.168:37949. Already tried 3 time(s); retry 
>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
>> MILLISECONDS)
>>
>>
>> The task ultimately fails. Any idea how this can be fixed? These jobs ran
>> fine on Tez 0.4.1.
>>
>> Thanks,
>> Rajat
>>
>
>


-- 
Best Regards

Jeff Zhang

Reply via email to