date:20190212

[jira] [Comment Edited] (MAPREDUCE-5890) Support for encrypting Intermediate data and spills in local filesystem

2019-02-12 Thread Gopi Krishnan Nambiar (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765567#comment-16765567
 ] 

Gopi Krishnan Nambiar edited comment on MAPREDUCE-5890 at 2/12/19 7:29 PM:
---

Hello [~vinodkv], [~chris.douglas], [~tucu00], [~asuresh],

 

Had a question around why this snippet of code was removed (which was added as 
part of this JIRA - MAPREDUCE-5890) in the File: JobSubmitter.java :
 {{int keyLen = CryptoUtils.isShuffleEncrypted(conf)}}? 
conf.getInt(MRJobConfig.MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS, 
MRJobConfig.DEFAULT_MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS): 
SHUFFLE_KEY_LENGTH;
  
 and later reverted and replaced with a constant value:
 {{keyGen.init(SHUFFLE_KEY_LENGTH);}}
 as part of this 
change:[https://github.com/apache/hadoop/commit/d9d7bbd99b533da5ca570deb3b8dc8a959c6b4db]
  
 Some context around this question: We are trying to go for FedRamp High 
Certification and that mandates a key length for HMAC-SHA1 to be at least 112 
bits and the current key length is 64 bits. Would be great to know your 
thoughts on this one.


was (Author: gkrishnan):
Hello [~vinodkv], [~chris.douglas], [~tucu00], [~asuresh],

 

Had a question around why this snippet of code was removed (which was added as 
part of this JIRA - MAPREDUCE-5890):
{{int keyLen = CryptoUtils.isShuffleEncrypted(conf)}}? 
conf.getInt(MRJobConfig.MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS, 
MRJobConfig.DEFAULT_MR_ENCRYPTED_INTERMEDIATE_DATA_KEY_SIZE_BITS): 
SHUFFLE_KEY_LENGTH;
 
and later reverted and replaced with a constant value:
{{keyGen.init(SHUFFLE_KEY_LENGTH);}}
as part of this 
change:[https://github.com/apache/hadoop/commit/d9d7bbd99b533da5ca570deb3b8dc8a959c6b4db]
 
Some context around this question: We are trying to go for FedRamp High 
Certification and that mandates a key length for HMAC-SHA1 to be at least 112 
bits and the current key length is 64 bits. Would be great to know your 
thoughts on this one.

> Support for encrypting Intermediate data and spills in local filesystem
> ---
>
> Key: MAPREDUCE-5890
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5890
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: security
>Affects Versions: 2.4.0
>Reporter: Alejandro Abdelnur
>Assignee: Arun Suresh
>Priority: Major
>  Labels: encryption
> Fix For: 2.6.0, fs-encryption
>
> Attachments: MAPREDUCE-5890.10.patch, MAPREDUCE-5890.11.patch, 
> MAPREDUCE-5890.12.patch, MAPREDUCE-5890.13.patch, MAPREDUCE-5890.14.patch, 
> MAPREDUCE-5890.15.patch, MAPREDUCE-5890.3.patch, MAPREDUCE-5890.4.patch, 
> MAPREDUCE-5890.5.patch, MAPREDUCE-5890.6.patch, MAPREDUCE-5890.7.patch, 
> MAPREDUCE-5890.8.patch, MAPREDUCE-5890.9.patch, 
> org.apache.hadoop.mapred.TestMRIntermediateDataEncryption-output.txt, 
> syslog.tar.gz
>
>
> For some sensitive data, encryption while in flight (network) is not 
> sufficient, it is required that while at rest it should be encrypted. 
> HADOOP-10150 & HDFS-6134 bring encryption at rest for data in filesystem 
> using Hadoop FileSystem API. MapReduce intermediate data and spills should 
> also be encrypted while at rest.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Updated] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)



 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated MAPREDUCE-7184:
--
Description: 
TestJobCounters test cases are failing in trunk while validating the input 
files size with BYTES_READ by the job. The crc files are considered in 
getFileSize whereas the job FileInputFormat -ignores them.- has stopped reading 
the CRC counters



  was:
TestJobCounters test cases are failing in trunk while validating the input 
files size with BYTES_READ by the job. The crc files are considered in 
getFileSize whereas the job FileInputFormat ignores them.




> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat -ignores them.- has stopped 
> reading the CRC counters



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766254#comment-16766254
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

Created HADOOP-16107; this is (a) fairly fundamental and (b) has existed for a 
while on various create()/createNonRecursive calls ... just nobody had ever 
noticed.

Fix will include tests for all of this

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766184#comment-16766184
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

FWIW I am going to fix this, because its (a) broadly fundamental and (b) with 
HADOOP-14397, been lurking for a while.

The createFile bit of the fix will need to be backported to 3.2 & earlier 
because it means that even if you have a checksummed FS, if you created a file 
through the new createFile() builder, it wasn't going to be checksummed. Guess 
we are lucky nobody has been using this much (yet)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Updated] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)



 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated MAPREDUCE-7184:
--
Status: Open  (was: Patch Available)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Assigned] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)



 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran reassigned MAPREDUCE-7184:
-

Assignee: Steve Loughran

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Steve Loughran
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766178#comment-16766178
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

Update: I understand the issue now, have tests to replicate the problem and a 
fix for this and the same issue which affects the createFileBuilder of 
HADOOP-14394

Essentially: the create/open file builders need to refer back to the checksum 
FS create/open methods to ensure that the checksumming on create and read are 
taking place. 

this little test found the problem. Thank you :)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Assigned] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Prabhu Joseph (JIRA)



 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned MAPREDUCE-7184:


Assignee: (was: Prabhu Joseph)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Prabhu Joseph (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766091#comment-16766091
 ] 

Prabhu Joseph commented on MAPREDUCE-7184:
--

Thanks [~ste...@apache.org] for the details. I am OK either way. 

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766054#comment-16766054
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

* If we look at changes here, the openFile() command set things up to use a 
CompletableFuture<> for the opening, which, by default, is actually evaluated 
in the same thread as the caller (i.e. its a blocking operation)
* But if the counter is not the same, it means that the 
getRawFilesystem.open("file.crc").readFully() isn't incrementing the thread 
local stats, which implies that it is somehow running in a different thread

Will that have adverse consequences? No, but it is a difference in behaviour, 
and that could be considered a regression. And I don't understand why it is 
happening, given that the open call (see {{FIleSystem.openFileWithOptions()}} 
is opened in the same thread as normal.

Thoughts
* Although the s3 select stuff through the MR pipeline going to have to go in 
later (MAPREDUCE-7182), I'd like to keep the openfile() code in as is because 
it lets us add custom options to files opened (specifically, I want to add an 
option to allow the seek format of a file to be declared). 

But: we could pull those changes in the MR code as is, with a goal of 
MAPREDUCE-7182 to add that stuff. including tests comparing the byte count 
options? Or: I can do something isolated just for here?

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Updated] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

2019-02-12 Thread Steve Loughran (JIRA)



 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated MAPREDUCE-7184:
--
Summary: TestJobCounters byte counters omitting crc file bytes read  (was: 
TestJobCounters#getFileSize can ignore crc file)

> TestJobCounters byte counters omitting crc file bytes read
> --
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Prabhu Joseph (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765994#comment-16765994
 ] 

Prabhu Joseph commented on MAPREDUCE-7184:
--

[~tasanuma0829] FileInputFormat does not consider hidden files 
(HiddenFileFilter) in the input list and so suspect the crc files are getting 
created after some change after which test case breaks. Since FileInputFormat 
does not consider crc files and so the getFileSize can also ignore.

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765988#comment-16765988
 ] 

Steve Loughran commented on MAPREDUCE-7184:
---

That's odd. And a possible sign of a greater problem: if this regresses, so can 
other things. I don't see why CRC reads should suddenly not be counted

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7185) Parallelize part files move in FileOutputCommitter

2019-02-12 Thread Steve Loughran (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-7185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765979#comment-16765979
]

Steve Loughran commented on MAPREDUCE-7185:
---

Igor, which cloud infra was this? S3 has more fundamental issues than rename
time, which is why we have the [zero rename
committer|https://github.com/steveloughran/zero-rename-committer/releases]
there. (I'm assuming GCS here, right?)

At the same time, yes, rename cost is high, especially on a store where time to
rename

We're only just starting to play with futures in the production source of
hadoop, where the fact that IOEs have to be caught and wrapped is a fundamental
problem with the java language

# Can you use CompleteableFuture<> over the simpler Future<>, it chains better
# see org.apache.hadoop.util.LambdaUtils for some support
# and org.apache.hadoop.fs.impl.FutureIOSupport for work on wrapping and
unwrapping IOEs, with WrappedException being exclusively for IOEs, so
straightforward to unwrap back to an IOE.
# the default # of threads should be 1, so on HDFS the behaviour is the same.
Cloud deployments with a consistent store can do more.

regarding the patch -1 as is

* IOEs must be extracted from the ExecutionException, rather than wrapped in a
generic "IOException" which loses the underlying failure code (so making it
hard to callers to interpret)
* We're going to change failure modes such that more than one may fail
simultaneously,

As Yetus says, "The patch doesn't appear to include any new or modified tests."
We'll need those, including some to help test failures.

Test wise
* what happens is if the thread count is negative? It should be OK
* what if the file length is 0? In that case, there isn't anything to rename at
all. Again, my read of the code implies that's fine.
* You could see about adding something for S3A and AWS; the S3A tests could be
skipped when S3Guard is disabled.
* a subclassable committer test in mapreduce-examples or mapreduce-client is
going to be needed here. Something for both the v1 and v2 algorithms.
* You effectively get that with HADOOP-16058: Terasort which can be used
against cloud infras. One for ABFS with a thread count > 1 would be nice
* In org.apache.hadoop.fs.s3a.commit.staging.TestStagingCommitter we've got
tests which use a mock FS to simulate and validate state. That could be used
for this purpose too (i.e. add a new test for the classic committer, now with
threads > 1); check it works, simulate failures.

Minor details
* check your import ordering
* add some javadocs for the new options
* And something in the MR docs + mapreduce-default.xml. Secret settings benefit
nobody except those who read the code line-by-line

Having done the work in HADOOP-13786, the code in that base FileOutputCommitter
scares me, and how the two commit algorithms are intermingled in something
co-recursive is part of the reason. Its why I added a whole new plugin point
for the object store committers.

h3. If you are going to go near that FileOutputCommitter, I'm going to want to
see as rigorous a proof of correctness and you can come up with.

V2 commit with >1 task writing to the same paths is the key risk ppoint: task A
writes to /dest/_temp/_job1_att_1/task_a/data1 but task be writes
/dest/_temp/_job1_att_1/task_b/data1/file2 ; that file to commit, data1 is
both a file from task A and a dir from task B. Things have to fail in
meaningful ways there, and a generic "IOExecption" doesn't qualify.

That zero-output-committer doc is a best effort there —and its definition of
"correctness" is the one I'll be reviewing this patch on. I think you could
take that and cover this parallel-rename algorithm.

*important* this isn't me trying to dissuade you —I agree, this would be great
for object stores with consistent listings but O(1) file renames, and guess
what: the v2 algorithm effectively works a file at a time to. It's just that
having this algorithm work correctly is critical to *everything* generating
correct output.

Two extra points

# we do now have a plugin point to slot in new commit algorithms underneath any
FileOutputFormat which doesn't subclass getOutputCommitter(); you do have the
option of adding a whole new committer for your store, which I will worry
slightly less about. If any change proposed to FileOutputCommitter downgrades
the normal HDFS output algorithm in any way (including loss of exception info
on failures), I'm going to say "do it in its own committer"

# having done the committer work, I think the v2 commit algorithm doesn't work
properly: it handles failures badly, in particular can't cope with partial
failure of a committer during the abort phase —and most executors, including
Spark, aren't prepared for that outcome. I don't expect this one to be any
better here, and with the parallelisation you can argue that the failure window
agai

[jira] [Assigned] (MAPREDUCE-7185) Parallelize part files move in FileOutputCommitter

2019-02-12 Thread Steve Loughran (JIRA)



 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran reassigned MAPREDUCE-7185:
-

Assignee: Igor Dvorzhak

> Parallelize part files move in FileOutputCommitter
> --
>
> Key: MAPREDUCE-7185
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7185
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 2.9.2
>Reporter: Igor Dvorzhak
>Assignee: Igor Dvorzhak
>Priority: Major
> Attachments: MAPREDUCE-7185.patch
>
>
> If map task outputs multiple files it could be slow to move them from temp 
> directory to output directory in object stores (GCS, S3, etc).
> To improve performance we need to parallelize move of more than 1 file in 
> FileOutputCommitter.
> Repro:
>  Start spark-shell:
> {code}
> spark-shell --num-executors 2 --executor-memory 10G --executor-cores 4 --conf 
> spark.dynamicAllocation.maxExecutors=2
> {code}
> From spark-shell:
> {code}
> val df = (1 to 1).toList.toDF("value").withColumn("p", $"value" % 
> 10).repartition(50)
> df.write.partitionBy("p").mode("overwrite").format("parquet").options(Map("path"
>  -> s"gs://some/path")).saveAsTable("parquet_partitioned_bench")
> {code}
> With the fix execution time reduces from 130 seconds to 50 seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Takanobu Asanuma (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765950#comment-16765950
 ] 

Takanobu Asanuma edited comment on MAPREDUCE-7184 at 2/12/19 11:33 AM:
---

[~Prabhu Joseph] Thanks for reporting the issue and uploading the patch.

The errors occur after HADOOP-15229 is merged. Looks like the BYTES_READ 
counter considered length of crc files until then. I'm not sure whether 
\{{TestJobCounters#getFileSize}} can ignore crc file.

[~ste...@apache.org] Could you take a look?


was (Author: tasanuma0829):
[~Prabhu Joseph] Thanks for reporting the issue and uploading the patch.

The errors occur after HADOOP-15229 is merged. Looks like the BYTES_READ 
counter considered length of crc files until then. So I'm not sure the patch is 
right.

[~ste...@apache.org] Could you take a look?

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

2019-02-12 Thread Takanobu Asanuma (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765950#comment-16765950
 ] 

Takanobu Asanuma commented on MAPREDUCE-7184:
-

[~Prabhu Joseph] Thanks for reporting the issue and uploading the patch.

The errors occur after HADOOP-15229 is merged. Looks like the BYTES_READ 
counter considered length of crc files until then. So I'm not sure the patch is 
right.

[~ste...@apache.org] Could you take a look?

> TestJobCounters#getFileSize can ignore crc file
> ---
>
> Key: MAPREDUCE-7184
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7184
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: MAPREDUCE-7184-001.patch, MAPREDUCE-7184-002.patch, 
> MAPREDUCE-7184-003.patch
>
>
> TestJobCounters test cases are failing in trunk while validating the input 
> files size with BYTES_READ by the job. The crc files are considered in 
> getFileSize whereas the job FileInputFormat ignores them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7099) Daily test result fails in MapReduce JobClient though there isn't any error

2019-02-12 Thread Takanobu Asanuma (JIRA)



[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765947#comment-16765947
 ] 

Takanobu Asanuma commented on MAPREDUCE-7099:
-

[~ste...@apache.org] The errors in your last comment are related to another 
issue, MAPREDUCE-7184. Let's fix it first.

> Daily test result fails in MapReduce JobClient though there isn't any error
> ---
>
> Key: MAPREDUCE-7099
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7099
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: build, test
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Critical
>
> Looks like the test result in MapReduce JobClient always fails lately. Please 
> see the results of hadoop-qbt-trunk-java8-linux-x86:
>  
> [https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/]/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient.txt
> {noformat}
> [INFO] Results:
> [INFO] 
> [WARNING] Tests run: 565, Failures: 0, Errors: 0, Skipped: 10
> [INFO] 
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 02:06 h
> [INFO] Finished at: 2018-05-30T12:32:39+00:00
> [INFO] Final Memory: 25M/645M
> [INFO] 
> 
> [WARNING] The requested profile "parallel-tests" could not be activated 
> because it does not exist.
> [WARNING] The requested profile "shelltest" could not be activated because it 
> does not exist.
> [WARNING] The requested profile "native" could not be activated because it 
> does not exist.
> [WARNING] The requested profile "yarn-ui" could not be activated because it 
> does not exist.
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.21.0:test (default-test) on 
> project hadoop-mapreduce-client-jobclient: There was a timeout or other error 
> in the fork -> [Help 1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (MAPREDUCE-5890) Support for encrypting Intermediate data and spills in local filesystem

[jira] [Updated] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

[jira] [Updated] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

[jira] [Assigned] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

[jira] [Assigned] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

[jira] [Updated] (MAPREDUCE-7184) TestJobCounters byte counters omitting crc file bytes read

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

[jira] [Commented] (MAPREDUCE-7185) Parallelize part files move in FileOutputCommitter

[jira] [Assigned] (MAPREDUCE-7185) Parallelize part files move in FileOutputCommitter

[jira] [Comment Edited] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

[jira] [Commented] (MAPREDUCE-7184) TestJobCounters#getFileSize can ignore crc file

[jira] [Commented] (MAPREDUCE-7099) Daily test result fails in MapReduce JobClient though there isn't any error

18 matches

Site Navigation

Mail list logo

Footer information