Filed: https://issues.apache.org/jira/browse/TEZ-2630

On Tue, Jul 21, 2015 at 11:12 AM, Rajat Jain <[email protected]> wrote:

> Tried that. AM is listening at the right address. But TezChild is
> receiving the IP address instead of the private DNS.
>
> AM logs:
>
> *2015-07-21 18:09:27,906 INFO 
> [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] 
> app.TaskAttemptListenerImpTezDag: Listening at address: 
> ip-10-234-2-80.ec2.internal:49967
> *
>
>
> TezChild logs:
>
> 2015-07-21 18:09:35,353 INFO [main] task.TezChild: TezChild 
> starting*2015-07-21 18:09:35,379 INFO [main] task.TezChild: Args: 
> 10.234.2.80,49967,container_1437501941642_0001_01_000002,application_1437501941642_0001,1
> *2015-07-21 18:09:35,770 INFO [main] task.TezChild: Using socket factory 
> class: org.apache.hadoop.net.StandardSocketFactory
> 2015-07-21 18:09:35,785 INFO [main] task.TezChild: PID, containerIdentifier:  
> 8670, container_1437501941642_0001_01_000002
> 2015-07-21 18:09:35,864 INFO [main] Configuration.deprecation: 
> fs.default.name is deprecated. Instead, use fs.defaultFS
> 2015-07-21 18:09:36,403 INFO [main] task.TezChild: Got host:port: 
> 10.234.2.80:49967
> 2015-07-21 18:09:36,413 INFO [main] task.TezChild: address variables: 
> 10.234.2.80:49967
>
>
> Any idea what changed between 0.4.1 and 0.7.0? Things worked fine out of
> the box in 0.4.1.
>
> On Tue, Jul 21, 2015 at 10:28 AM, Jeff Zhang <[email protected]> wrote:
>
>> DAGClientRPCServer is for client service, not for TezChild. You need look
>> at "Instantiated TaskAttemptListener RPC at"
>>
>> On Tue, Jul 21, 2015 at 10:21 AM, Rajat Jain <[email protected]> wrote:
>>
>>> Here are the AM logs:
>>>
>>> 2015-07-21 17:08:14,279 INFO [ServiceThread:DAGClientRPCServer] 
>>> ipc.CallQueueManager: Using callQueue class 
>>> java.util.concurrent.LinkedBlockingQueue
>>> 2015-07-21 17:08:14,285 INFO 
>>> [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] 
>>> ipc.CallQueueManager: Using callQueue class 
>>> java.util.concurrent.LinkedBlockingQueue
>>> 2015-07-21 17:08:14,299 INFO [Socket Reader #1 for port 46373] ipc.Server: 
>>> Starting Socket Reader #1 for port 46373
>>> 2015-07-21 17:08:14,300 INFO [Socket Reader #1 for port 37949] ipc.Server: 
>>> Starting Socket Reader #1 for port 37949
>>> 2015-07-21 17:08:14,358 INFO [IPC Server Responder] ipc.Server: IPC Server 
>>> Responder: starting
>>> 2015-07-21 17:08:14,364 INFO [IPC Server listener on 46373] ipc.Server: IPC 
>>> Server listener on 46373: starting
>>> 2015-07-21 17:08:14,364 INFO [IPC Server Responder] ipc.Server: IPC Server 
>>> Responder: starting
>>> 2015-07-21 17:08:14,365 INFO [IPC Server listener on 37949] ipc.Server: IPC 
>>> Server listener on 37949: starting
>>> 2015-07-21 17:08:14,374 INFO [ServiceThread:DAGClientRPCServer] 
>>> client.DAGClientServer: Instantiated DAGClientRPCServer at 
>>> ip-10-16-141-168.ec2.internal/10.16.141.168:46373
>>> 2015-07-21 17:08:14,377 INFO [HistoryEventHandlingThread] 
>>> impl.SimpleHistoryLoggingService: Writing event AM_LAUNCHED to history file
>>>
>>>
>>> The interesting thing to note is the Tez Task is trying to connect to
>>> port 37949. The DAGClientRPCServer (which uses private DNS) is instantiated
>>> on 46373. But it also starts another IPC server on 37949 though I'm not
>>> sure what it is for.
>>>
>>> On Tue, Jul 21, 2015 at 10:13 AM, Rajat Jain <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am running a yarn cluster on AWS. The slave nodes (NMs) are all
>>>> configured to listen on private DNS. For example, a sample node manager
>>>> listens on ip-10-16-141-168.ec2.internal:8042
>>>> <https://multicluster.qubole.net/cluster-proxy?encodedUrl=http%3A%2F%2Fip-10-16-141-168.ec2.internal%3A8042%2F>
>>>> .
>>>>
>>>> When I'm trying to run a Tez job (even simple ones like select count(*)
>>>> from nation) - they fail because child tasks are unable to connect to the
>>>> AM. The issue is they are trying to connect to the IP instead of the
>>>> private DNS. Here's a sample log line (couple of them added by me for
>>>> debugging):
>>>>
>>>> 2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting
>>>> 2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket factory 
>>>> class: org.apache.hadoop.net.StandardSocketFactory
>>>> 2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID, 
>>>> containerIdentifier:  3699, container_1437498369268_0001_01_000002
>>>> 2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation: 
>>>> fs.default.name is deprecated. Instead, use fs.defaultFS
>>>> 2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port: 
>>>> 10.16.141.168:37949
>>>> 2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables: 
>>>> 10.16.141.168:37949
>>>> 2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter: Attempting 
>>>> to fetch new task
>>>> 2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect to 
>>>> server: 10.16.141.168/10.16.141.168:37949. Already tried 0 time(s); retry 
>>>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, 
>>>> sleepTime=1000 MILLISECONDS)
>>>> 2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect to 
>>>> server: 10.16.141.168/10.16.141.168:37949. Already tried 1 time(s); retry 
>>>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, 
>>>> sleepTime=1000 MILLISECONDS)
>>>> 2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect to 
>>>> server: 10.16.141.168/10.16.141.168:37949. Already tried 2 time(s); retry 
>>>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, 
>>>> sleepTime=1000 MILLISECONDS)
>>>> 2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect to 
>>>> server: 10.16.141.168/10.16.141.168:37949. Already tried 3 time(s); retry 
>>>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, 
>>>> sleepTime=1000 MILLISECONDS)
>>>>
>>>>
>>>> The task ultimately fails. Any idea how this can be fixed? These jobs
>>>> ran fine on Tez 0.4.1.
>>>>
>>>> Thanks,
>>>> Rajat
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>

Reply via email to