[jira] [Created] (MAPREDUCE-6876) FileInputFormat.listStatus should not fetch delegation tokens

2017-04-13 Thread Michael Gummelt (JIRA)
Michael Gummelt created MAPREDUCE-6876:
--

 Summary: FileInputFormat.listStatus should not fetch delegation 
tokens
 Key: MAPREDUCE-6876
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6876
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Michael Gummelt


{{FileInputFormat.listStatus}} fetches delegation tokens: 
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213

AFAICT, this is unnecessary.  {{listStatus}} doesn't delegate those tokens to 
another process.  This is causing issues described in the attached Spark 
Kerberos ticket, because {{TokenCache.obtainTokensForNameNodes}}, which is used 
to fetch the delegation tokens, assumes that certain MapReduce configuration 
variables are set, which isn't true in the Spark calling code.  This is a 
separate problem, but nonetheless it wouldn't have arisen if {{listStatus}} 
weren't fetching delegation tokens.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: Skip bad records when streaming supported?

2017-04-13 Thread Daniel Templeton

To quote the docs:

---
This feature can be used when map/reduce tasks crashes deterministically 
on certain input. This happens due to bugs in the map/reduce function. 
The usual course would be to fix these bugs. But sometimes this is not 
possible; perhaps the bug is in third party libraries for which the 
source code is not available. Due to this, the task never reaches to 
completion even with multiple attempts and complete data for that task 
is lost.


With this feature, only a small portion of data is lost surrounding the 
bad record, which may be acceptable for some user applications. see 
setMapperMaxSkipRecords(Configuration, long)

---

Basically, it's a heavy-handed approach that you should only use as a 
last resort.


Daniel


On 4/13/17 3:24 PM, Pillis W wrote:

Thanks Daniel.

Please correct me if I have understood this incorrectly, but according 
to the documentation at 
http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records 
, it seemed like the sole purpose of this functionality is to tolerate 
unknown failures/exceptions in mappers/reducers. If I was able to 
catch all failures, I do not need to even use this ability - is that 
not true?


If I have understood it incorrectly, when would one use the feature to 
skip bad records?


Regards,
PW




On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton > wrote:


You have to modify wordcount-mapper-t1.py to just ignore the bad
line.  In the worst case, you should be able to do something like:

for line in sys.stdin:
  try:
# Insert processing code here
  except:
# Error processing record, ignore it
pass

Daniel


On 4/13/17 1:33 PM, Pillis W wrote:

Hello,
I am using 'hadoop-streaming.jar' to do a simple word count,
and want to
skip records that fail execution. Below is the actual command
I run, and
the mapper always fails on one record, and hence fails the
job. The input
file is 3 lines with 1 bad line.

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D
mapred.job.name =SkipTest
-Dmapreduce.task.skip.start.at
tempts=1
-Dmapreduce.map.skip.maxrecords=1
-Dmapreduce.reduce.skip.maxgroups=1
-Dmapreduce.map.skip.proc.count.autoincr=false
-Dmapreduce.reduce.skip.proc.count.autoincr=false -D
mapred.reduce.tasks=1
-D mapred.map.tasks=1 -files

/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
-input /user/hadoop/data/test1 -output
/user/hadoop/data/output-test-5
-mapper "python wordcount-mapper-t1.py" -reducer "python
wordcount-reducer-t1.py"


I was wondering if skipping of records is supported when
MapReduce is used
in streaming mode?

Thanks in advance.
PW



-
To unsubscribe, e-mail:
mapreduce-dev-unsubscr...@hadoop.apache.org

For additional commands, e-mail:
mapreduce-dev-h...@hadoop.apache.org






-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Re: Skip bad records when streaming supported?

2017-04-13 Thread Pillis W
Thanks Daniel.

Please correct me if I have understood this incorrectly, but according to
the documentation at
http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records
, it seemed like the sole purpose of this functionality is to tolerate
unknown failures/exceptions in mappers/reducers. If I was able to catch all
failures, I do not need to even use this ability - is that not true?

If I have understood it incorrectly, when would one use the feature to skip
bad records?

Regards,
PW




On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton 
wrote:

> You have to modify wordcount-mapper-t1.py to just ignore the bad line.  In
> the worst case, you should be able to do something like:
>
> for line in sys.stdin:
>   try:
> # Insert processing code here
>   except:
> # Error processing record, ignore it
> pass
>
> Daniel
>
>
> On 4/13/17 1:33 PM, Pillis W wrote:
>
>> Hello,
>> I am using 'hadoop-streaming.jar' to do a simple word count, and want to
>> skip records that fail execution. Below is the actual command I run, and
>> the mapper always fails on one record, and hence fails the job. The input
>> file is 3 lines with 1 bad line.
>>
>> hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name
>> =SkipTest
>> -Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
>> -Dmapreduce.reduce.skip.maxgroups=1
>> -Dmapreduce.map.skip.proc.count.autoincr=false
>> -Dmapreduce.reduce.skip.proc.count.autoincr=false -D
>> mapred.reduce.tasks=1
>> -D mapred.map.tasks=1 -files
>> /home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordc
>> ount-reducer-t1.py
>> -input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
>> -mapper "python wordcount-mapper-t1.py" -reducer "python
>> wordcount-reducer-t1.py"
>>
>>
>> I was wondering if skipping of records is supported when MapReduce is used
>> in streaming mode?
>>
>> Thanks in advance.
>> PW
>>
>>
>
> -
> To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org
>
>


Re: Skip bad records when streaming supported?

2017-04-13 Thread Daniel Templeton
You have to modify wordcount-mapper-t1.py to just ignore the bad line.  
In the worst case, you should be able to do something like:


for line in sys.stdin:
  try:
# Insert processing code here
  except:
# Error processing record, ignore it
pass

Daniel

On 4/13/17 1:33 PM, Pillis W wrote:

Hello,
I am using 'hadoop-streaming.jar' to do a simple word count, and want to
skip records that fail execution. Below is the actual command I run, and
the mapper always fails on one record, and hence fails the job. The input
file is 3 lines with 1 bad line.

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name=SkipTest
-Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
-Dmapreduce.reduce.skip.maxgroups=1
-Dmapreduce.map.skip.proc.count.autoincr=false
-Dmapreduce.reduce.skip.proc.count.autoincr=false -D mapred.reduce.tasks=1
-D mapred.map.tasks=1 -files
/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
-input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
-mapper "python wordcount-mapper-t1.py" -reducer "python
wordcount-reducer-t1.py"


I was wondering if skipping of records is supported when MapReduce is used
in streaming mode?

Thanks in advance.
PW




-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org



Skip bad records when streaming supported?

2017-04-13 Thread Pillis W
Hello,
I am using 'hadoop-streaming.jar' to do a simple word count, and want to
skip records that fail execution. Below is the actual command I run, and
the mapper always fails on one record, and hence fails the job. The input
file is 3 lines with 1 bad line.

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name=SkipTest
-Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
-Dmapreduce.reduce.skip.maxgroups=1
-Dmapreduce.map.skip.proc.count.autoincr=false
-Dmapreduce.reduce.skip.proc.count.autoincr=false -D mapred.reduce.tasks=1
-D mapred.map.tasks=1 -files
/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
-input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
-mapper "python wordcount-mapper-t1.py" -reducer "python
wordcount-reducer-t1.py"


I was wondering if skipping of records is supported when MapReduce is used
in streaming mode?

Thanks in advance.
PW


Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86

2017-04-13 Thread Apache Jenkins Server
For more details, see 
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/

[Apr 12, 2017 3:20:08 PM] (epayne) YARN-6450. TestContainerManagerWithLCE 
requires override for each new
[Apr 12, 2017 5:02:13 PM] (cnauroth) HADOOP-14248. Retire 
SharedInstanceProfileCredentialsProvider in trunk.
[Apr 12, 2017 6:17:31 PM] (templedf) HADOOP-14246. Authentication Tokens should 
use SecureRandom instead of
[Apr 12, 2017 6:29:24 PM] (kihwal) HDFS-11648. Lazy construct the IIP pathname. 
Contributed by Daryn Sharp.
[Apr 12, 2017 6:40:58 PM] (aengineer) HDFS-11645. DataXceiver thread should log 
the actual error when getting
[Apr 12, 2017 7:24:32 PM] (wang) HDFS-11565. Use compact identifiers for 
built-in ECPolicies in
[Apr 12, 2017 7:27:34 PM] (wang) HDFS-10996. Ability to specify per-file EC 
policy at create time.
[Apr 12, 2017 8:43:18 PM] (junping_du) YARN-3760. FSDataOutputStream leak in
[Apr 12, 2017 9:21:20 PM] (kasha) YARN-6432. FairScheduler: Reserve preempted 
resources for corresponding
[Apr 12, 2017 9:30:34 PM] (liuml07) HADOOP-14255. S3A to delete unnecessary 
fake directory objects in
[Apr 12, 2017 11:07:10 PM] (liuml07) HADOOP-14274. Azure: Simplify Ranger-WASB 
policy model. Contributed by




-1 overall


The following subsystems voted -1:
asflicense unit


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

Failed junit tests :

   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting 
   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure 
   hadoop.hdfs.server.datanode.TestDirectoryScanner 
   hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits 
   hadoop.hdfs.TestReadStripedFileWithMissingBlocks 
   hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks 
   hadoop.hdfs.server.datanode.TestDataNodeUUID 
   
hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService 
   hadoop.yarn.server.TestContainerManagerSecurity 
   hadoop.yarn.server.TestMiniYarnClusterNodeUtilization 
   hadoop.yarn.applications.distributedshell.TestDistributedShell 
   hadoop.mapred.TestMRTimelineEventHandling 
   hadoop.tools.TestDistCpSystem 
   hadoop.tools.TestHadoopArchiveLogsRunner 
   hadoop.metrics2.impl.TestKafkaMetrics 
  

   cc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/diff-compile-cc-root.txt
  [4.0K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/diff-compile-javac-root.txt
  [184K]

   checkstyle:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/diff-checkstyle-root.txt
  [17M]

   pylint:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/diff-patch-pylint.txt
  [20K]

   shellcheck:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/diff-patch-shellcheck.txt
  [20K]

   shelldocs:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/diff-patch-shelldocs.txt
  [12K]

   whitespace:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/whitespace-eol.txt
  [12M]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/whitespace-tabs.txt
  [1.2M]

   javadoc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/diff-javadoc-javadoc-root.txt
  [2.2M]

   unit:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
  [528K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
  [60K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-tests.txt
  [324K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-applications-distributedshell.txt
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-jobclient.txt
  [88K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/patch-unit-hadoop-tools_hadoop-distcp.txt
  [20K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/375/artifact/out/patch-unit-hadoop-tools_hadoop-archive-logs.txt
  [8.0K]
   

Apache Hadoop qbt Report: trunk+JDK8 on Linux/ppc64le

2017-04-13 Thread Apache Jenkins Server
For more details, see 
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/

[Apr 12, 2017 3:20:08 PM] (epayne) YARN-6450. TestContainerManagerWithLCE 
requires override for each new
[Apr 12, 2017 5:02:13 PM] (cnauroth) HADOOP-14248. Retire 
SharedInstanceProfileCredentialsProvider in trunk.
[Apr 12, 2017 6:17:31 PM] (templedf) HADOOP-14246. Authentication Tokens should 
use SecureRandom instead of
[Apr 12, 2017 6:29:24 PM] (kihwal) HDFS-11648. Lazy construct the IIP pathname. 
Contributed by Daryn Sharp.
[Apr 12, 2017 6:40:58 PM] (aengineer) HDFS-11645. DataXceiver thread should log 
the actual error when getting
[Apr 12, 2017 7:24:32 PM] (wang) HDFS-11565. Use compact identifiers for 
built-in ECPolicies in
[Apr 12, 2017 7:27:34 PM] (wang) HDFS-10996. Ability to specify per-file EC 
policy at create time.
[Apr 12, 2017 8:43:18 PM] (junping_du) YARN-3760. FSDataOutputStream leak in
[Apr 12, 2017 9:21:20 PM] (kasha) YARN-6432. FairScheduler: Reserve preempted 
resources for corresponding
[Apr 12, 2017 9:30:34 PM] (liuml07) HADOOP-14255. S3A to delete unnecessary 
fake directory objects in
[Apr 12, 2017 11:07:10 PM] (liuml07) HADOOP-14274. Azure: Simplify Ranger-WASB 
policy model. Contributed by




-1 overall


The following subsystems voted -1:
compile mvninstall unit


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc javac


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

Failed junit tests :

   hadoop.hdfs.server.namenode.ha.TestPipelinesFailover 
   hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewer 
   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting 
   hadoop.hdfs.TestRollingUpgrade 
   hadoop.hdfs.web.TestWebHdfsTimeouts 
   hadoop.mapreduce.v2.hs.TestHistoryServerLeveldbStateStoreService 
   hadoop.mapred.TestShuffleHandler 
   hadoop.tools.TestHadoopArchiveLogsRunner 
   hadoop.metrics2.impl.TestKafkaMetrics 
   hadoop.yarn.applications.distributedshell.TestDistributedShell 
   hadoop.yarn.server.timeline.TestRollingLevelDB 
   hadoop.yarn.server.timeline.TestTimelineDataManager 
   hadoop.yarn.server.timeline.TestLeveldbTimelineStore 
   hadoop.yarn.server.timeline.recovery.TestLeveldbTimelineStateStore 
   hadoop.yarn.server.timeline.TestRollingLevelDBTimelineStore 
   
hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer 
   hadoop.yarn.server.resourcemanager.recovery.TestLeveldbRMStateStore 
   hadoop.yarn.server.resourcemanager.TestRMRestart 
   hadoop.yarn.server.resourcemanager.TestRMAdminService 
   hadoop.yarn.server.TestMiniYarnClusterNodeUtilization 
   hadoop.yarn.server.TestContainerManagerSecurity 
   hadoop.yarn.server.timeline.TestLevelDBCacheTimelineStore 
   hadoop.yarn.server.timeline.TestOverrideTimelineStoreYarnClient 
   hadoop.yarn.server.timeline.TestEntityGroupFSTimelineStore 

Timed out junit tests :

   org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCache 
  

   mvninstall:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-mvninstall-root.txt
  [492K]

   compile:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-compile-root.txt
  [20K]

   cc:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-compile-root.txt
  [20K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-compile-root.txt
  [20K]

   unit:

   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-unit-hadoop-assemblies.txt
  [4.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
  [496K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-hs.txt
  [16K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-nativetask.txt
  [40K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-shuffle.txt
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-unit-hadoop-tools_hadoop-archive-logs.txt
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-ppc/287/artifact/out/patch-unit-hadoop-tools_hadoop-kafka.txt
  [8.0K]