[jira] [Commented] (MAPREDUCE-7298) Distcp doesn't close the job after the job is completed

2020-10-05 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208225#comment-17208225
 ] 

Steve Loughran commented on MAPREDUCE-7298:
---

* should this go into hadoop 3.3.x?
* Arpit: you know now that you have to maintain distcp until someone else 
writes a patch for it :)

> Distcp doesn't close the job after the job is completed
> ---
>
> Key: MAPREDUCE-7298
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7298
> Project: Hadoop Map/Reduce
>  Issue Type: Task
>  Components: distcp
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: MAPREDUCE-7298.01.patch, MAPREDUCE-7298.02.patch
>
>
> Distcp doesn't close the job after the job is completed. This leads to leaked 
> Truststore Reloader Threads.
> The fix is to close the job once it is complete. job.close internally calls 
> yarnClient.close(), which then calls timelineConnector.serviceStop() . This 
> destroys the sslFactory cleaning up the ReloadingX509TrustManager.
> Without the patch for each distcp job, a new ReloadingX509TrustManager is 
> created which creates a new thread. These threads are never killed and they 
> remain like that till HS2 is restarted. With the close, the thread will be 
> cleaned up once the job is completed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default

2020-10-05 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208145#comment-17208145
 ] 

Steve Loughran commented on MAPREDUCE-7282:
---

bq.  Tasks request permission from the AM to commit.

yes, and then we assume that they continue to completion, rather than pausing 
for an extended period of time, so by the time the AM/spark driver gets a 
timeout, it can be assumed to be one of a network failure or the worker has 
failed/VM/k8s container terminated. The "suspended for a long time and then 
continues" risk does exist,  and is unlikely on a physical cluster, but in a 
world of VMs, not entirely inconceivable. 

I note the MR AM does track its time from last heartbeat to the YARN RM to 
detect partitions, workers don't. 

> MR v2 commit algorithm should be deprecated and not the default
> ---
>
> Key: MAPREDUCE-7282
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7282
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 3.3.0, 3.2.1, 3.1.3, 3.3.1
>Reporter: Steve Loughran
>Priority: Major
>
> The v2 MR commit algorithm moves files from the task attempt dir into the 
> dest dir on task commit -one by one
> It is therefore not atomic
> # if a task commit fails partway through and another task attempt commits 
> -unless exactly the same filenames are used, output of the first attempt may 
> be included in the final result
> # if a worker partitions partway through task commit, and then continues 
> after another attempt has committed, it may partially overwrite the output 
> -even when the filenames are the same
> Both MR and spark assume that task commits are atomic. Either they need to 
> consider that this is not the case, we add a way to probe for a committer 
> supporting atomic task commit, and the engines both add handling for task 
> commit failures (probably fail job)
> Better: we remove this as the default, maybe also warn when it is being used



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org