[jira] [Updated] (MAPREDUCE-7059) Downward Compatibility issue: MR job fails because of unknown setErasureCodingPolicy method from 3.x client to HDFS 2.x cluster

2018-04-04 Thread Lei (Eddy) Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu updated MAPREDUCE-7059:
-
Fix Version/s: (was: 3.0.2)
   3.0.3

> Downward Compatibility issue: MR job fails because of unknown 
> setErasureCodingPolicy method from 3.x client to HDFS 2.x cluster
> ---
>
> Key: MAPREDUCE-7059
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7059
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: job submission
>Affects Versions: 3.0.0
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Critical
> Fix For: 3.1.0, 3.0.3
>
> Attachments: MAPREDUCE-7059.001.patch, MAPREDUCE-7059.002.patch, 
> MAPREDUCE-7059.003.patch, MAPREDUCE-7059.004.patch, MAPREDUCE-7059.005.patch, 
> MAPREDUCE-7059.006.patch
>
>
> Running teragen failed in the version of hadoop-3.1, and hdfs server is 2.8.
> {code:java}
> bin/hadoop jar 
> share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0-SNAPSHOT.jar  teragen  
> 10 /teragen
> {code}
> The reason of failing is 2.8 HDFS does not have setErasureCodingPolicy.
> one  solution is parsing RemoteException in 
> JobResourceUploader#disableErasure like this:
> {code:java}
> private void disableErasureCodingForPath(FileSystem fs, Path path)
>   throws IOException {
> try {
>   if (jtFs instanceof DistributedFileSystem) {
> LOG.info("Disabling Erasure Coding for path: " + path);
> DistributedFileSystem dfs = (DistributedFileSystem) jtFs;
> dfs.setErasureCodingPolicy(path,
> SystemErasureCodingPolicies.getReplicationPolicy().getName());
>   }
> } catch (RemoteException e) {
>   if (!e.getClassName().equals(RpcNoSuchMethodException.class.getName())) 
> {
> throw e;
>   } else {
> LOG.warn(
> "hdfs server does not have method disableErasureCodingForPath," 
> + " and skip disableErasureCodingForPath", e);
>   }
> }
>   }
> {code}
> Does anyone have better solution?
> The detailed exception trace is:
> {code:java}
> 2018-02-26 11:22:53,178 INFO mapreduce.JobSubmitter: Cleaning up the staging 
> area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1518615699369_0006
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException):
>  Unknown method setErasureCodingPolicy called on 
> org.apache.hadoop.hdfs.protocol.ClientProtocol protocol.
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:436)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
>   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1437)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>   at com.sun.proxy.$Proxy11.setErasureCodingPolicy(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolTranslatorPB.java:1583)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy12.setErasureCodingPolicy(Unknown Source)
>   at 
> 

[jira] [Commented] (MAPREDUCE-7069) Add ability to specify user environment variables individually

2018-04-04 Thread Jim Brennan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425908#comment-16425908
 ] 

Jim Brennan commented on MAPREDUCE-7069:


[~jlowe] thanks for the thorough review.  Much appreciated!
{quote}It's a bit odd and inefficient that setVMEnv calls 
MRApps.setEnvFromInputProperty twice. I think it would be clearer and more 
efficient to call it once, place the results in a temporary map (like it 
already does in the second call), then only set HADOOP_ROOT_LOGGER and 
HADOOP_CLIENT_OPTS in the environment if they are not set in the temporary map. 
Then at the end we can simply call addAll to dump the contents of the temporary 
map into the environment map.
{quote}
The reason it was done with two calls is because of the way environment 
variables are handled when they are already defined in the environment map. If 
you an environment variable that you are updating already exists in the 
environment, setEnvFromInput* functions append the new value to the existing 
value, using the appropriate separator. The special handling for 
HADOOP_ROOT_LOGGER and HADOOP_CLIENT_OPTS is to overwrite them instead of 
appending.  That said, I can definitely change it to do it the way you suggest, 
except I can't just use addAll() - you ultimately need to use 
Apps.addToEnvironment on each k/v pair. I could expose an 
Apps.setEnvFromInputStringMapNoExpand() (or add a noExpand boolean to the 
existing one) to handle this though.  

Thanks for the documentation/comment recommendations - I was going to ask about 
that - I'll clean those up.
{quote}Nit: setEnvFromInputStringMap does not need to be public.
{quote}
Will fix. In an earlier iteration I was calling this directly.
{quote}Would it be easier to call tmpEnv.addAll(inputMap) and pass tmpEnv 
instead of inputMap? Then we don't need to explicitly iterate the map.
{quote}
Yes.  I will make this change.

> Add ability to specify user environment variables individually
> --
>
> Key: MAPREDUCE-7069
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7069
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: MAPREDUCE-7069.001.patch, MAPREDUCE-7069.002.patch
>
>
> As reported in YARN-6830, it is currently not possible to specify an 
> environment variable that contains commas via {{mapreduce.map.env}}, 
> mapreduce.reduce.env, or {{mapreduce.admin.user.env}}.
> To address this, [~aw] proposed in [YARN-6830] that we add the ability to 
> specify environment variables individually:
> {quote}e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7069) Add ability to specify user environment variables individually

2018-04-04 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425770#comment-16425770
 ] 

Jason Lowe commented on MAPREDUCE-7069:
---

Thanks for the patch!

It's a bit odd and inefficient that setVMEnv calls 
MRApps.setEnvFromInputProperty twice.  I think it would be clearer and more 
efficient to call it once, place the results in a temporary map (like it 
already does in the second call), then only set HADOOP_ROOT_LOGGER and 
HADOOP_CLIENT_OPTS in the environment if they are not set in the temporary map. 
 Then at the end we can simply call addAll to dump the contents of the 
temporary map into the environment map.

The example documentation in JobConf is confusing.  It uses 
"MAPRED_MAP_TASK_ENV" and "MAPRED_REDUCE_TASK_ENV" but those literal strings 
should not be used in the property name.  It would be clearer if this used 
"mapreduce.map.env" and "mapreduce.reduce.env" in the examples.  Either that or 
give the example in the Java realm with something like set(MAPRED_MAP_TASK_ENV 
+ ".varName", varValue) so it's clearly not a literal string in the property 
name.  My pereference is the former.

The relevant property descriptions in mapred-default.xml should be updated to 
reflect the new functionality.

It would be good to update MapReduceTutorial.md to document the options for 
passing environment variables to tasks.

There are a number of comments in setEnvFromString that should be fixed up.  I 
realize this is mostly cut-n-paste from the old setEnvFromInputString, but 
since we're refactoring it would be nice to clean it up a bit in the process.  
There's not such thing as a tt (tasktracker) in YARN, and the comments imply 
this is only called to setup the env by a nodemanager for a child process.  
That's not always the case.  "note" s/b "not", etc.

For javadoc comments it's not necessary to state the type of the variable after 
the variable name.  Javadoc can automatically extract this from the method 
signature.

Nit: setEnvFromInputStringMap does not need to be public.

Would it be easier to call tmpEnv.addAll(inputMap) and pass tmpEnv instead of 
inputMap?  Then we don't need to explicitly iterate the map.

The unit test should add a new properies with commas and or equal signs in the 
value and verify the values come through in the environment map.

Does it make sense to split some of the unit test up into separate tests?  For 
example the null input test can easily stand by itself.  Separate tests make it 
easier to identify what's working and what's broken rather than a stacktrace 
with a line number in the middle of a large unit test that is testing many 
different aspects.


> Add ability to specify user environment variables individually
> --
>
> Key: MAPREDUCE-7069
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7069
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: MAPREDUCE-7069.001.patch, MAPREDUCE-7069.002.patch
>
>
> As reported in YARN-6830, it is currently not possible to specify an 
> environment variable that contains commas via {{mapreduce.map.env}}, 
> mapreduce.reduce.env, or {{mapreduce.admin.user.env}}.
> To address this, [~aw] proposed in [YARN-6830] that we add the ability to 
> specify environment variables individually:
> {quote}e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7072) mapred job -history prints duplicate counter in human output

2018-04-04 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425564#comment-16425564
 ] 

Wilfred Spiegelenburg commented on MAPREDUCE-7072:
--

 The root cause of the issue is located in the {{AbstractCounters}} code 
{{getGroupNames()}}

When you track through the code in the debugger the number of counter groups 
returned is higher than expected. This is due to the fact that we add the 
deprecated counters names to the list of counter group names before we return. 
The display name of the counters that are tracked in the deprecated list, 
stored in the legacyMap, are the same as the display names in the 
non-deprecated counters. The deprecated counters added are already in the non 
deprecated list which causes the duplication.
It works in the JSON format because it internally uses a HashMap. The HashMap 
uses the name of the counter groups as the key. The keys clash and we thus 
overwrite the existing value with the value from the deprecated value.

To track where this issue is coming from: MAPREDUCE-4053 changed the iteration 
to work for oozie and seems related to OOZIE-777 and the HadoopELFunctions 
which still seems to use the deprecated counter name.
Changing what the method returns is thus not possible without breaking oozie. 
We can use the iterator that can be returned by the abstract counters as it 
does not include the deprecated names.

> mapred job -history prints duplicate counter in human output
> 
>
> Key: MAPREDUCE-7072
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7072
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.0.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
>  'mapred job -history' command prints duplicate entries for counters only for 
> the human output format. It does not do this for the JSON format.
> mapred job -history /user/history/somefile.jhist -format human
> {code}
> 
> |Job Counters |Total megabyte-seconds taken by all map tasks|0 |0 |268,288,000
> ...
> |Job Counters |Total megabyte-seconds taken by all map tasks|0 |0 |268,288,000
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-7072) mapred job -history prints duplicate counter in human output

2018-04-04 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created MAPREDUCE-7072:


 Summary: mapred job -history prints duplicate counter in human 
output
 Key: MAPREDUCE-7072
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7072
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client
Affects Versions: 3.0.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


 'mapred job -history' command prints duplicate entries for counters only for 
the human output format. It does not do this for the JSON format.

mapred job -history /user/history/somefile.jhist -format human
{code}

|Job Counters |Total megabyte-seconds taken by all map tasks|0 |0 |268,288,000
...
|Job Counters |Total megabyte-seconds taken by all map tasks|0 |0 |268,288,000

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org