[jira] [Comment Edited] (MAPREDUCE-5890) Support for encrypting Intermediate data and spills in local filesystem

2019-02-12 Thread Gopi Krishnan Nambiar (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765567#comment-16765567
 ] 

Gopi Krishnan Nambiar edited comment on MAPREDUCE-5890 at 2/12/19 7:29 PM:
---

Hello [~vinodkv], [~chris.douglas], [~tucu00], [~asuresh],

 

Had a question around why this snippet of code was removed (which was added as 
part of this JIRA - MAPREDUCE-5890) in the File: JobSubmitter.java :
 {{int keyLen = CryptoUtils.isShuffleEncrypted(conf)}}? 
conf.getInt(MRJobConfig.MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS, 
MRJobConfig.DEFAULT_MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS): 
SHUFFLE_KEY_LENGTH;
  
 and later reverted and replaced with a constant value:
 {{keyGen.init(SHUFFLE_KEY_LENGTH);}}
 as part of this 
change:[https://github.com/apache/hadoop/commit/d9d7bbd99b533da5ca570deb3b8dc8a959c6b4db]
  
 Some context around this question: We are trying to go for FedRamp High 
Certification and that mandates a key length for HMAC-SHA1 to be at least 112 
bits and the current key length is 64 bits. Would be great to know your 
thoughts on this one.


was (Author: gkrishnan):
Hello [~vinodkv], [~chris.douglas], [~tucu00], [~asuresh],

 

Had a question around why this snippet of code was removed (which was added as 
part of this JIRA - MAPREDUCE-5890):
{{int keyLen = CryptoUtils.isShuffleEncrypted(conf)}}? 
conf.getInt(MRJobConfig.MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS, 
MRJobConfig.DEFAULT_MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS): 
SHUFFLE_KEY_LENGTH;
 
and later reverted and replaced with a constant value:
{{keyGen.init(SHUFFLE_KEY_LENGTH);}}
as part of this 
change:[https://github.com/apache/hadoop/commit/d9d7bbd99b533da5ca570deb3b8dc8a959c6b4db]
 
Some context around this question: We are trying to go for FedRamp High 
Certification and that mandates a key length for HMAC-SHA1 to be at least 112 
bits and the current key length is 64 bits. Would be great to know your 
thoughts on this one.

> Support for encrypting Intermediate data and spills in local filesystem
> ---
>
> Key: MAPREDUCE-5890
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5890
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: security
>Affects Versions: 2.4.0
>Reporter: Alejandro Abdelnur
>Assignee: Arun Suresh
>Priority: Major
>  Labels: encryption
> Fix For: 2.6.0, fs-encryption
>
> Attachments: MAPREDUCE-5890.10.patch, MAPREDUCE-5890.11.patch, 
> MAPREDUCE-5890.12.patch, MAPREDUCE-5890.13.patch, MAPREDUCE-5890.14.patch, 
> MAPREDUCE-5890.15.patch, MAPREDUCE-5890.3.patch, MAPREDUCE-5890.4.patch, 
> MAPREDUCE-5890.5.patch, MAPREDUCE-5890.6.patch, MAPREDUCE-5890.7.patch, 
> MAPREDUCE-5890.8.patch, MAPREDUCE-5890.9.patch, 
> org.apache.hadoop.mapred.TestMRIntermediateDataEncryption-output.txt, 
> syslog.tar.gz
>
>
> For some sensitive data, encryption while in flight (network) is not 
> sufficient, it is required that while at rest it should be encrypted. 
> HADOOP-10150 & HDFS-6134 bring encryption at rest for data in filesystem 
> using Hadoop FileSystem API. MapReduce intermediate data and spills should 
> also be encrypted while at rest.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated MAPREDUCE-7184:
--
Description: 
TestJobCounters test cases are failing in trunk while validating the input 
files size with BYTES_READ by the job. The crc files are considered in 
getFileSize whereas the job FileInputFormat -ignores them.- has stopped reading 
the CRC counters



  was:
TestJobCounters test cases are failing in trunk while validating the input 
files size with BYTES_READ by the job. The crc files are considered in 
getFileSize whereas the job FileInputFormat ignores them.




> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat -ignores them.- has stopped 
> reading the CRC counters



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766254#comment-16766254
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

Created HADOOP-16107; this is (a) fairly fundamental and (b) has existed for a 
while on various create()/createNonRecursive calls ... just nobody had ever 
noticed.

Fix will include tests for all of this

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766184#comment-16766184
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

FWIW I am going to fix this, because its (a) broadly fundamental and (b) with 
HADOOP-14397, been lurking for a while.

The createFile bit of the fix will need to be backported to 3.2 & earlier 
because it means that even if you have a checksummed FS, if you created a file 
through the new createFile() builder, it wasn't going to be checksummed. Guess 
we are lucky nobody has been using this much (yet)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated MAPREDUCE-7184:
--
Status: Open  (was: Patch Available)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Assigned] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran reassigned MAPREDUCE-7184:
-

Assignee: Steve Loughran

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766178#comment-16766178
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

Update: I understand the issue now, have tests to replicate the problem and a 
fix for this and the same issue which affects the createFileBuilder of 
HADOOP-14394

Essentially: the create/open file builders need to refer back to the checksum 
FS create/open methods to ensure that the checksumming on create and read are 
taking place. 

this little test found the problem. Thank you :)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Assigned] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned MAPREDUCE-7184:


Assignee: (was: Prabhu Joseph)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766091#comment-16766091
 ] 

Prabhu Joseph commented on MAPREDUCE-7184:
--

Thanks [~ste...@apache.org] for the details. I am OK either way. 

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766054#comment-16766054
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

* If we look at changes here, the openFile() command set things up to use a 
CompletableFuture<> for the opening, which, by default, is actually evaluated 
in the same thread as the caller (i.e. its a blocking operation)
* But if the counter is not the same, it means that the 
getRawFilesystem.open("file.crc").readFully() isn't incrementing the thread 
local stats, which implies that it is somehow running in a different thread

Will that have adverse consequences? No, but it is a difference in behaviour, 
and that could be considered a regression. And I don't understand why it is 
happening, given that the open call (see {{FIleSystem.openFileWithOptions()}} 
is opened in the same thread as normal.

Thoughts
* Although the s3 select stuff through the MR pipeline going to have to go in 
later (MAPREDUCE-7182), I'd like to keep the openfile() code in as is because 
it lets us add custom options to files opened (specifically, I want to add an 
option to allow the seek format of a file to be declared). 

But: we could pull those changes in the MR code as is, with a goal of 
MAPREDUCE-7182 to add that stuff. including tests comparing the byte count 
options? Or: I can do something isolated just for here?

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated MAPREDUCE-7184:
--
Summary: TestJobCounters byte counters omitting crc file bytes read  (was: 
TestJobCounters#getFileSize can ignore crc file)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765994#comment-16765994
 ] 

Prabhu Joseph commented on MAPREDUCE-7184:
--

[~tasanuma0829] FileInputFormat does not consider hidden files 
(HiddenFileFilter) in the input list and so suspect the crc files are getting 
created after some change after which test case breaks. Since FileInputFormat 
does not consider crc files and so the getFileSize can also ignore.

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765988#comment-16765988
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

That's odd. And a possible sign of a greater problem: if this regresses, so can 
other things. I don't see why CRC reads should suddenly not be counted

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7185) Parallelize part files move in FileOutputCommitter

2019-02-12 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765979#comment-16765979
 ] 

Steve Loughran commented on MAPREDUCE-7185:
---

Igor, which cloud infra was this? S3 has more fundamental issues than rename 
time, which is why we have the [zero rename 
committer|https://github.com/steveloughran/zero-rename-committer/releases] 
there. (I'm assuming GCS here, right?)

At the same time, yes, rename cost is high, especially on a store where time to 
rename

We're only just starting to play with futures in the production source of 
hadoop, where the fact that IOEs have to be caught and wrapped is a fundamental 
problem with the java language

# Can you use CompleteableFuture<> over the simpler Future<>, it chains better
# see org.apache.hadoop.util.LambdaUtils for some support
# and org.apache.hadoop.fs.impl.FutureIOSupport for work on wrapping and 
unwrapping IOEs, with WrappedException being exclusively for IOEs, so 
straightforward to unwrap back to an IOE.
# the default # of threads should be 1, so on HDFS the behaviour is the same. 
Cloud deployments with a consistent store can do more.


regarding the patch -1 as is

 * IOEs must be extracted from the ExecutionException, rather than wrapped in a 
generic "IOException" which loses the underlying failure code (so making it 
hard to callers to interpret)
* We're going to change failure modes such that more than one may fail 
simultaneously, 



As Yetus says, "The patch doesn't appear to include any new or modified tests." 
We'll need those, including some to help test failures.


Test wise
* what happens is if the thread count is negative? It should be OK
* what if the file length is 0? In that case, there isn't anything to rename at 
all. Again, my read of the code implies that's fine.
* You could see about adding something for S3A and AWS; the S3A tests could be 
skipped when S3Guard is disabled. 
* a subclassable committer test in mapreduce-examples or mapreduce-client is 
going to be needed here. Something for both the v1 and v2 algorithms.
* You effectively get that with HADOOP-16058: Terasort which can be used 
against cloud infras. One for ABFS with a thread count > 1 would be nice
* In org.apache.hadoop.fs.s3a.commit.staging.TestStagingCommitter we've got 
tests which use a mock FS to simulate and validate state. That could be used 
for this purpose too (i.e. add a new test for the classic committer, now with 
threads > 1); check it works, simulate failures.

Minor details
* check your import ordering
* add some javadocs for the new options
* And something in the MR docs + mapreduce-default.xml. Secret settings benefit 
nobody except those who read the code line-by-line

Having done the work in HADOOP-13786, the code in that base FileOutputCommitter 
scares me, and how the two commit algorithms are intermingled in something 
co-recursive is part of the reason. Its why I added a whole new plugin point 
for the object store committers.

h3. If you are going to go near that FileOutputCommitter, I'm going to want to 
see as rigorous a proof of correctness and you can come up with.

V2 commit with >1 task writing to the same paths is the key risk ppoint: task A 
writes to /dest/_temp/_job1_att_1/task_a/data1 but task be writes 
/dest/_temp/_job1_att_1/task_b/data1/file2  ; that file to commit, data1 is 
both a file from task A and a dir from task B. Things have to fail in 
meaningful ways there, and a generic "IOExecption" doesn't qualify.

That zero-output-committer doc is a best effort there —and its definition of 
"correctness" is the one I'll be reviewing this patch on. I think you could 
take that and cover this parallel-rename algorithm.

*important* this isn't me trying to dissuade you —I agree, this would be great 
for object stores with consistent listings but O(1) file renames, and guess 
what: the v2 algorithm effectively works a file at a time to. It's just that 
having this algorithm work correctly is critical to *everything* generating 
correct output.

Two extra points

# we do now have a plugin point to slot in new commit algorithms underneath any 
FileOutputFormat which doesn't subclass getOutputCommitter(); you do have the 
option of adding a whole new committer for your store, which I will worry 
slightly less about. If any change proposed to FileOutputCommitter downgrades 
the normal HDFS output algorithm in any way (including loss of exception info 
on failures), I'm going to say "do it in its own committer"

# having done the committer work, I think the v2 commit algorithm doesn't work 
properly: it handles failures badly, in particular can't cope with partial 
failure of a committer during the abort phase —and most executors, including 
Spark, aren't prepared for that outcome. I don't expect this one to be any 
better here, and with the parallelisation you can argue that the failure window 
agai

[jira] [Assigned] (MAPREDUCE-7185) Parallelize part files move in FileOutputCommitter

2019-02-12 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran reassigned MAPREDUCE-7185:
-

Assignee: Igor Dvorzhak

> Parallelize part files move in FileOutputCommitter
> --
>
> Key: MAPREDUCE-7185
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7185
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 2.9.2
>Reporter: Igor Dvorzhak
>Assignee: Igor Dvorzhak
>Priority: Major
> Attachments: MAPREDUCE-7185.patch
>
>
> If map task outputs multiple files it could be slow to move them from temp 
> directory to output directory in object stores (GCS, S3, etc).
> To improve performance we need to parallelize move of more than 1 file in 
> FileOutputCommitter.
> Repro:
>  Start spark-shell:
> {code}
> spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf 
> spark.dynamicAllocation.maxExecutors=2
> {code}
> From spark-shell:
> {code}
> val df = (1 to 1).toList.toDF("value").withColumn("p", $"value" % 
> 10).repartition(50)
> df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path"
>  -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
> {code}
> With the fix execution time reduces from 130 seconds to 50 seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Takanobu Asanuma (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765950#comment-16765950
 ] 

Takanobu Asanuma edited comment on MAPREDUCE-7184 at 2/12/19 11:33 AM:
---

[~Prabhu Joseph] Thanks for reporting the issue and uploading the patch.

The errors occur after HADOOP-15229 is merged. Looks like the BYTES_READ 
counter considered length of crc files until then. I'm not sure whether 
\{{TestJobCounters#getFileSize}} can ignore crc file.

[~ste...@apache.org] Could you take a look?


was (Author: tasanuma0829):
[~Prabhu Joseph] Thanks for reporting the issue and uploading the patch.

The errors occur after HADOOP-15229 is merged. Looks like the BYTES_READ 
counter considered length of crc files until then. So I'm not sure the patch is 
right.

[~ste...@apache.org] Could you take a look?

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Takanobu Asanuma (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765950#comment-16765950
 ] 

Takanobu Asanuma commented on MAPREDUCE-7184:
-

[~Prabhu Joseph] Thanks for reporting the issue and uploading the patch.

The errors occur after HADOOP-15229 is merged. Looks like the BYTES_READ 
counter considered length of crc files until then. So I'm not sure the patch is 
right.

[~ste...@apache.org] Could you take a look?

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7099) Daily test result fails in MapReduce JobClient though there isn't any error

2019-02-12 Thread Takanobu Asanuma (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765947#comment-16765947
 ] 

Takanobu Asanuma commented on MAPREDUCE-7099:
-

[~ste...@apache.org] The errors in your last comment are related to another 
issue, MAPREDUCE-7184. Let's fix it first.

> Daily test result fails in MapReduce JobClient though there isn't any error
> ---
>
> Key: MAPREDUCE-7099
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7099
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: build, test
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Critical
>
> Looks like the test result in MapReduce JobClient always fails lately. Please 
> see the results of hadoop-qbt-trunk-java8-linux-x86:
>  
> [https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/]/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient.txt
> {noformat}
> [INFO] Results:
> [INFO] 
> [WARNING] Tests run: 565, Failures: 0, Errors: 0, Skipped: 10
> [INFO] 
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 02:06 h
> [INFO] Finished at: 2018-05-30T12:32:39+00:00
> [INFO] Final Memory: 25M/645M
> [INFO] 
> 
> [WARNING] The requested profile "parallel-tests" could not be activated 
> because it does not exist.
> [WARNING] The requested profile "shelltest" could not be activated because it 
> does not exist.
> [WARNING] The requested profile "native" could not be activated because it 
> does not exist.
> [WARNING] The requested profile "yarn-ui" could not be activated because it 
> does not exist.
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.21.0:test (default-test) on 
> project hadoop-mapreduce-client-jobclient: There was a timeout or other error 
> in the fork -> [Help 1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org