[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable
[ https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307365#comment-17307365 ] Bimalendu Choudhary commented on MAPREDUCE-7331: The temporary files gets deleted at the end of the commitJob when we get the pendingjobAttemptPath and simply delete that path. So anything inside gets deleted. I don't think that underlying attempt task attempt paths get deleted individually. So in my case whether the other application had the same Mapreduce jobId or not, does not matter. Even if they share the same JObID/taskattempt path, they will be writing to different partition directories inside it. To me looks like on application finishes first and ends up deleting the whole _temporary directory. For now the workaorund we are trying out is configuring not to delete the _temporary file at the end when we know that we have multiple spark application using the same directory. In my case we are running multiple Spark application to process individual partition of the same table to make the processing fast. Since all are separate partitions so there is no chance of data interference. But we end up getting FileNotFound exception. > Make temporary directory used by FileOutputCommitter configurable > - > > Key: MAPREDUCE-7331 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 3.0.0 > Environment: CDH 6.2.1 Hadoop 3.0.0 >Reporter: Bimalendu Choudhary >Priority: Major > > Spark SQL applications uses FileOutputCommitter to commit and merge its files > under a table directory. The hardcoded PENDING_DIR_NAME = _temporary > directory results in multiple application using the same temporary directory. > This casues unwanted results of one application interfering with other > applications temporary files. Also one application ending up deleting > temporary files of other. There is no way right now for applications to have > there unique path to store the temporary files to avoid any interference from > other totally independent applications. I think the temporary directory > being used by FileOutputCommitter should be made configurable to let the > caller call with with its own unique value as per the requirement and avoid > it getting deleted or overwritten by other applications > Something like: > {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary"; > public static final String PENDING_DIR_NAME_DEFAULT = > "mapreduce.fileoutputcommitter.tempdir"; > {quote} > > This can be used very efficiently by Spark applications to handle even stage > failures where temporary directories from previous attempts cause problem and > can help in so many situations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable
[ https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307365#comment-17307365 ] Bimalendu Choudhary edited comment on MAPREDUCE-7331 at 3/23/21, 7:16 PM: -- The temporary files gets deleted at the end of the commitJob when we get the pendingjobAttemptPath and simply delete that path. So anything inside gets deleted. I don't think that underlying attempt task attempt paths get deleted individually. So in my case whether the other application had the same Mapreduce jobId or not, does not matter. Even if they share the same JObID/taskattempt path, they will be writing to different partition directories inside it. To me looks like one application finishes first and ends up deleting the whole _temporary directory. For now the workaorund we are trying out is configuring not to delete the _temporary file at the end when we know that we have multiple spark application using the same directory. In my case we are running multiple Spark application to process individual partition of the same table to make the processing fast. Since all are separate partitions so there is no chance of data interference. But we end up getting FileNotFound exception. was (Author: bimalenduc): The temporary files gets deleted at the end of the commitJob when we get the pendingjobAttemptPath and simply delete that path. So anything inside gets deleted. I don't think that underlying attempt task attempt paths get deleted individually. So in my case whether the other application had the same Mapreduce jobId or not, does not matter. Even if they share the same JObID/taskattempt path, they will be writing to different partition directories inside it. To me looks like on application finishes first and ends up deleting the whole _temporary directory. For now the workaorund we are trying out is configuring not to delete the _temporary file at the end when we know that we have multiple spark application using the same directory. In my case we are running multiple Spark application to process individual partition of the same table to make the processing fast. Since all are separate partitions so there is no chance of data interference. But we end up getting FileNotFound exception. > Make temporary directory used by FileOutputCommitter configurable > - > > Key: MAPREDUCE-7331 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 3.0.0 > Environment: CDH 6.2.1 Hadoop 3.0.0 >Reporter: Bimalendu Choudhary >Priority: Major > > Spark SQL applications uses FileOutputCommitter to commit and merge its files > under a table directory. The hardcoded PENDING_DIR_NAME = _temporary > directory results in multiple application using the same temporary directory. > This casues unwanted results of one application interfering with other > applications temporary files. Also one application ending up deleting > temporary files of other. There is no way right now for applications to have > there unique path to store the temporary files to avoid any interference from > other totally independent applications. I think the temporary directory > being used by FileOutputCommitter should be made configurable to let the > caller call with with its own unique value as per the requirement and avoid > it getting deleted or overwritten by other applications > Something like: > {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary"; > public static final String PENDING_DIR_NAME_DEFAULT = > "mapreduce.fileoutputcommitter.tempdir"; > {quote} > > This can be used very efficiently by Spark applications to handle even stage > failures where temporary directories from previous attempts cause problem and > can help in so many situations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable
[ https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307252#comment-17307252 ] Steve Loughran commented on MAPREDUCE-7331: --- Does the spark version you have contain the fix [SPARK-33402][CORE] Jobs launched in same second have duplicate MapReduce JobIDs ? As that may the underlying problem: you have >1 stage reusing the same jobID, so are using the same job directory under _temporary. Apply that fix first before worrying about going anywhere near FileOutputCommitter. We are scared of changes there as it is a critical part of so many applications. > Make temporary directory used by FileOutputCommitter configurable > - > > Key: MAPREDUCE-7331 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 >Affects Versions: 3.0.0 > Environment: CDH 6.2.1 Hadoop 3.0.0 >Reporter: Bimalendu Choudhary >Priority: Major > > Spark SQL applications uses FileOutputCommitter to commit and merge its files > under a table directory. The hardcoded PENDING_DIR_NAME = _temporary > directory results in multiple application using the same temporary directory. > This casues unwanted results of one application interfering with other > applications temporary files. Also one application ending up deleting > temporary files of other. There is no way right now for applications to have > there unique path to store the temporary files to avoid any interference from > other totally independent applications. I think the temporary directory > being used by FileOutputCommitter should be made configurable to let the > caller call with with its own unique value as per the requirement and avoid > it getting deleted or overwritten by other applications > Something like: > {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary"; > public static final String PENDING_DIR_NAME_DEFAULT = > "mapreduce.fileoutputcommitter.tempdir"; > {quote} > > This can be used very efficiently by Spark applications to handle even stage > failures where temporary directories from previous attempts cause problem and > can help in so many situations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable
Bimalendu Choudhary created MAPREDUCE-7331: -- Summary: Make temporary directory used by FileOutputCommitter configurable Key: MAPREDUCE-7331 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 3.0.0 Environment: CDH 6.2.1 Hadoop 3.0.0 Reporter: Bimalendu Choudhary Spark SQL applications uses FileOutputCommitter to commit and merge its files under a table directory. The hardcoded PENDING_DIR_NAME = _temporary directory results in multiple application using the same temporary directory. This casues unwanted results of one application interfering with other applications temporary files. Also one application ending up deleting temporary files of other. There is no way right now for applications to have there unique path to store the temporary files to avoid any interference from other totally independent applications. I think the temporary directory being used by FileOutputCommitter should be made configurable to let the caller call with with its own unique value as per the requirement and avoid it getting deleted or overwritten by other applications Something like: {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary"; public static final String PENDING_DIR_NAME_DEFAULT = "mapreduce.fileoutputcommitter.tempdir"; {quote} This can be used very efficiently by Spark applications to handle even stage failures where temporary directories from previous attempts cause problem and can help in so many situations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-7322) revisiting TestMRIntermediateDataEncryption
[ https://issues.apache.org/jira/browse/MAPREDUCE-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307115#comment-17307115 ] Ahmed Hussein commented on MAPREDUCE-7322: -- Thanks [~Jim_Brennan]! > revisiting TestMRIntermediateDataEncryption > > > Key: MAPREDUCE-7322 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7322 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: job submission, security, test >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Labels: patch-available > Fix For: 3.4.0, 3.1.5, 3.3.1, 2.10.2, 3.2.3 > > Attachments: MAPREDUCE-7322.001.patch, MAPREDUCE-7322.002.patch, > MAPREDUCE-7322.003.patch, MAPREDUCE-7322.004.patch, MAPREDUCE-7322.005.patch, > MAPREDUCE-7322.006.patch, MAPREDUCE-7322.007.patch, MAPREDUCE-7322.008.patch, > MAPREDUCE-7322.009.patch, MAPREDUCE-7322.branch-2.10.009.patch, > MAPREDUCE-7322.branch-3.2.009.patch > > > I was reviewing {{TestMRIntermediateDataEncryption}}. The unit test has > actually little to do with encryption. > I have the following conclusion: > * Enabling/Disabling {{MRJobConfig.MR_ENCRYPTED_INTERMEDIATE_DATA}} does not > change the behavior of the unit test. > * There are no spill files generated by either mappers/reducers > * Wrapping I/O streams with Crypto never happens during the execution of the > unit test. > Unless I misunderstand the purpose of that unit test, I suggest that it gets > re-implemented so that it validates encryption in spilled intermediate data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Work logged] (MAPREDUCE-7270) TestHistoryViewerPrinter could be failed when the locale isn't English.
[ https://issues.apache.org/jira/browse/MAPREDUCE-7270?focusedWorklogId=570258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-570258 ] ASF GitHub Bot logged work on MAPREDUCE-7270: - Author: ASF GitHub Bot Created on: 23/Mar/21 06:23 Start Date: 23/Mar/21 06:23 Worklog Time Spent: 10m Work Description: liuml07 commented on pull request #1942: URL: https://github.com/apache/hadoop/pull/1942#issuecomment-804651653 Thank you very much @ayushtkn The report was not posted here, but the new run was fine. Since there is a checkstyle warning, could you rebase this PR from `trunk` branch, and also fixe the checkstyle warning @sungpeo ? > ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/jobhistory/TestHistoryViewerPrinter.java:58: Locale.setDefault(DEFAULT_LOCALE);: 'method def' child has incorrect indentation level 6, expected level should be 4. [Indentation] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 570258) Time Spent: 1h 20m (was: 1h 10m) > TestHistoryViewerPrinter could be failed when the locale isn't English. > --- > > Key: MAPREDUCE-7270 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7270 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: test >Reporter: Sungpeo Kook >Assignee: Sungpeo Kook >Priority: Minor > Labels: pull-request-available > Attachments: MAPREDUCE-7270.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Both of testHumanPrinter and testHumanPrinterAll have expected string for > assertion. > But the actual result includes the Dateformat which can be different depends > on Locale. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org