[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

2021-03-23 Thread Bimalendu Choudhary (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307365#comment-17307365
 ] 

Bimalendu Choudhary commented on MAPREDUCE-7331:


The temporary files gets deleted at the end of the commitJob when we get the 
pendingjobAttemptPath and simply delete that path. So anything inside gets 
deleted.  I don't think that underlying attempt task attempt paths get deleted 
individually.  So in my case whether the other application had the same 
Mapreduce jobId or not, does not matter. Even if they share the same 
JObID/taskattempt path, they will be writing to different partition directories 
inside it. 

To me looks like on  application finishes first and ends up deleting the whole 
_temporary directory. For now the workaorund we are trying out is configuring 
not to delete the _temporary file at the end when we know that we have multiple 
spark application using the same directory.

In my case we are running multiple Spark application to process individual 
partition of the same table to make the processing fast. Since all are separate 
partitions so there is no chance of  data interference. But we end up getting 
FileNotFound exception.

 

 

 

> Make temporary directory used by FileOutputCommitter configurable
> -
>
> Key: MAPREDUCE-7331
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 3.0.0
> Environment: CDH 6.2.1 Hadoop 3.0.0
>Reporter: Bimalendu Choudhary
>Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files 
> under a table directory. The hardcoded PENDING_DIR_NAME = _temporary 
> directory results in multiple application using the same temporary directory. 
> This casues unwanted results of one application interfering with other 
> applications temporary files. Also one application ending up deleting 
> temporary files of other. There is no way right now for applications to have 
> there unique path to store the temporary files to avoid any interference from 
> other totally independent applications.  I think the temporary directory 
> being used by FileOutputCommitter should be made configurable to let the 
> caller call with with its own unique value as per the requirement and avoid 
> it getting deleted or overwritten by other applications 
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
>  public static final String PENDING_DIR_NAME_DEFAULT =
>  "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>  
> This can be used very efficiently by Spark applications to handle even stage 
> failures where temporary directories from previous attempts cause problem and 
> can help in so many situations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

2021-03-23 Thread Bimalendu Choudhary (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307365#comment-17307365
 ] 

Bimalendu Choudhary edited comment on MAPREDUCE-7331 at 3/23/21, 7:16 PM:
--

The temporary files gets deleted at the end of the commitJob when we get the 
pendingjobAttemptPath and simply delete that path. So anything inside gets 
deleted.  I don't think that underlying attempt task attempt paths get deleted 
individually.  So in my case whether the other application had the same 
Mapreduce jobId or not, does not matter. Even if they share the same 
JObID/taskattempt path, they will be writing to different partition directories 
inside it. 

To me looks like one  application finishes first and ends up deleting the whole 
_temporary directory. For now the workaorund we are trying out is configuring 
not to delete the _temporary file at the end when we know that we have multiple 
spark application using the same directory.

In my case we are running multiple Spark application to process individual 
partition of the same table to make the processing fast. Since all are separate 
partitions so there is no chance of  data interference. But we end up getting 
FileNotFound exception.

 

 

 


was (Author: bimalenduc):
The temporary files gets deleted at the end of the commitJob when we get the 
pendingjobAttemptPath and simply delete that path. So anything inside gets 
deleted.  I don't think that underlying attempt task attempt paths get deleted 
individually.  So in my case whether the other application had the same 
Mapreduce jobId or not, does not matter. Even if they share the same 
JObID/taskattempt path, they will be writing to different partition directories 
inside it. 

To me looks like on  application finishes first and ends up deleting the whole 
_temporary directory. For now the workaorund we are trying out is configuring 
not to delete the _temporary file at the end when we know that we have multiple 
spark application using the same directory.

In my case we are running multiple Spark application to process individual 
partition of the same table to make the processing fast. Since all are separate 
partitions so there is no chance of  data interference. But we end up getting 
FileNotFound exception.

 

 

 

> Make temporary directory used by FileOutputCommitter configurable
> -
>
> Key: MAPREDUCE-7331
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 3.0.0
> Environment: CDH 6.2.1 Hadoop 3.0.0
>Reporter: Bimalendu Choudhary
>Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files 
> under a table directory. The hardcoded PENDING_DIR_NAME = _temporary 
> directory results in multiple application using the same temporary directory. 
> This casues unwanted results of one application interfering with other 
> applications temporary files. Also one application ending up deleting 
> temporary files of other. There is no way right now for applications to have 
> there unique path to store the temporary files to avoid any interference from 
> other totally independent applications.  I think the temporary directory 
> being used by FileOutputCommitter should be made configurable to let the 
> caller call with with its own unique value as per the requirement and avoid 
> it getting deleted or overwritten by other applications 
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
>  public static final String PENDING_DIR_NAME_DEFAULT =
>  "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>  
> This can be used very efficiently by Spark applications to handle even stage 
> failures where temporary directories from previous attempts cause problem and 
> can help in so many situations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

2021-03-23 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307252#comment-17307252
 ] 

Steve Loughran commented on MAPREDUCE-7331:
---

Does the spark version you have contain the fix  [SPARK-33402][CORE] Jobs 
launched in same second have duplicate MapReduce JobIDs ?

As that may the underlying problem: you have >1 stage reusing the same jobID, 
so are using the same job directory under _temporary.

Apply that fix first before worrying about going anywhere near 
FileOutputCommitter. We are scared of changes there as it is a critical part of 
so many applications. 

> Make temporary directory used by FileOutputCommitter configurable
> -
>
> Key: MAPREDUCE-7331
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2
>Affects Versions: 3.0.0
> Environment: CDH 6.2.1 Hadoop 3.0.0
>Reporter: Bimalendu Choudhary
>Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files 
> under a table directory. The hardcoded PENDING_DIR_NAME = _temporary 
> directory results in multiple application using the same temporary directory. 
> This casues unwanted results of one application interfering with other 
> applications temporary files. Also one application ending up deleting 
> temporary files of other. There is no way right now for applications to have 
> there unique path to store the temporary files to avoid any interference from 
> other totally independent applications.  I think the temporary directory 
> being used by FileOutputCommitter should be made configurable to let the 
> caller call with with its own unique value as per the requirement and avoid 
> it getting deleted or overwritten by other applications 
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
>  public static final String PENDING_DIR_NAME_DEFAULT =
>  "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>  
> This can be used very efficiently by Spark applications to handle even stage 
> failures where temporary directories from previous attempts cause problem and 
> can help in so many situations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

2021-03-23 Thread Bimalendu Choudhary (Jira)
Bimalendu Choudhary created MAPREDUCE-7331:
--

 Summary: Make temporary directory used by FileOutputCommitter 
configurable
 Key: MAPREDUCE-7331
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2
Affects Versions: 3.0.0
 Environment: CDH 6.2.1 Hadoop 3.0.0
Reporter: Bimalendu Choudhary


Spark SQL applications uses FileOutputCommitter to commit and merge its files 
under a table directory. The hardcoded PENDING_DIR_NAME = _temporary directory 
results in multiple application using the same temporary directory. This casues 
unwanted results of one application interfering with other applications 
temporary files. Also one application ending up deleting temporary files of 
other. There is no way right now for applications to have there unique path to 
store the temporary files to avoid any interference from other totally 
independent applications.  I think the temporary directory being used by 
FileOutputCommitter should be made configurable to let the caller call with 
with its own unique value as per the requirement and avoid it getting deleted 
or overwritten by other applications 

Something like:
{quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
 public static final String PENDING_DIR_NAME_DEFAULT =
 "mapreduce.fileoutputcommitter.tempdir";
{quote}
 

This can be used very efficiently by Spark applications to handle even stage 
failures where temporary directories from previous attempts cause problem and 
can help in so many situations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7322) revisiting TestMRIntermediateDataEncryption

2021-03-23 Thread Ahmed Hussein (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307115#comment-17307115
 ] 

Ahmed Hussein commented on MAPREDUCE-7322:
--

Thanks [~Jim_Brennan]!

> revisiting TestMRIntermediateDataEncryption 
> 
>
> Key: MAPREDUCE-7322
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7322
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: job submission, security, test
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: patch-available
> Fix For: 3.4.0, 3.1.5, 3.3.1, 2.10.2, 3.2.3
>
> Attachments: MAPREDUCE-7322.001.patch, MAPREDUCE-7322.002.patch, 
> MAPREDUCE-7322.003.patch, MAPREDUCE-7322.004.patch, MAPREDUCE-7322.005.patch, 
> MAPREDUCE-7322.006.patch, MAPREDUCE-7322.007.patch, MAPREDUCE-7322.008.patch, 
> MAPREDUCE-7322.009.patch, MAPREDUCE-7322.branch-2.10.009.patch, 
> MAPREDUCE-7322.branch-3.2.009.patch
>
>
> I was reviewing {{TestMRIntermediateDataEncryption}}. The unit test has 
> actually little to do with encryption.
> I have the following conclusion:
> * Enabling/Disabling {{MRJobConfig.MR_ENCRYPTED_INTERMEDIATE_DATA}} does not 
> change the behavior of the unit test.
> * There are no spill files generated by either mappers/reducers
> * Wrapping I/O streams with Crypto never happens during the execution of the 
> unit test.
> Unless I misunderstand the purpose of that unit test, I suggest that it gets 
> re-implemented so that it validates encryption in spilled intermediate data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Work logged] (MAPREDUCE-7270) TestHistoryViewerPrinter could be failed when the locale isn't English.

2021-03-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7270?focusedWorklogId=570258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-570258
 ]

ASF GitHub Bot logged work on MAPREDUCE-7270:
-

Author: ASF GitHub Bot
Created on: 23/Mar/21 06:23
Start Date: 23/Mar/21 06:23
Worklog Time Spent: 10m 
  Work Description: liuml07 commented on pull request #1942:
URL: https://github.com/apache/hadoop/pull/1942#issuecomment-804651653


   Thank you very much @ayushtkn The report was not posted here, but the new 
run was fine.
   
   Since there is a checkstyle warning, could you rebase this PR from `trunk` 
branch, and also fixe the checkstyle warning @sungpeo ?
   
   > 
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/jobhistory/TestHistoryViewerPrinter.java:58:
  Locale.setDefault(DEFAULT_LOCALE);: 'method def' child has incorrect 
indentation level 6, expected level should be 4. [Indentation]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 570258)
Time Spent: 1h 20m  (was: 1h 10m)

> TestHistoryViewerPrinter could be failed when the locale isn't English.
> ---
>
> Key: MAPREDUCE-7270
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7270
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: test
>Reporter: Sungpeo Kook
>Assignee: Sungpeo Kook
>Priority: Minor
>  Labels: pull-request-available
> Attachments: MAPREDUCE-7270.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Both of testHumanPrinter and testHumanPrinterAll have expected string for 
> assertion.
> But the actual result includes the Dateformat which can be different depends 
> on Locale.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org