Re: how to get info about which data in hdfs or file system that a MapReduce job visits?

2017-07-27 Thread Ravi Prakash
Hi Jaxon!

MapReduce is just an application (one of many including Tez, Spark, Slider
etc.) that runs on Yarn. Each YARN application decides to log whatever it
wants. For MapReduce,
https://github.com/apache/hadoop/blob/27a1a5fde94d4d7ea0ed172635c146d594413781/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java#L762
logs which split is being processed. Are you not seeing this message?
Perhaps check the log level of the MapTask.

For the other YARN applications, the logging may be different.

In any case, for all the frameworks, if the file is on HDFS, the hdfs audit
log should have a record.

HTH
Ravi



On Wed, Jul 26, 2017 at 11:27 PM, Jaxon Hu  wrote:

> Hi!
>
> I was trying to implement a Hadoop/Spark audit tool, but l met a problem
> that I can’t get  the input file location and file name. I can get
> username, IP address, time, user command, all of these info  from
> hdfs-audit.log. But When I submit a MapReduce job, I can’t see input file
> location  neither in Hadoop logs or Hadoop ResourceManager. Does hadoop
> have API or log that contains these info through some configuration ?If it
> have ,What should I configure?
>
> Thanks.
>


how to get info about which data in hdfs or file system that a MapReduce job visits?

2017-07-27 Thread Jaxon Hu
Hi!

I was trying to implement a Hadoop/Spark audit tool, but l met a problem
that I can’t get  the input file location and file name. I can get
username, IP address, time, user command, all of these info  from
hdfs-audit.log. But When I submit a MapReduce job, I can’t see input file
location  neither in Hadoop logs or Hadoop ResourceManager. Does hadoop
have API or log that contains these info through some configuration ?If it
have ,What should I configure?

Thanks.