[jira] [Updated] (MAPREDUCE-6283) MRHistoryServer log files management optimization

2015-07-02 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated MAPREDUCE-6283:

Priority: Major  (was: Minor)

 MRHistoryServer log files management optimization
 -

 Key: MAPREDUCE-6283
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6283
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobhistoryserver
Reporter: Zhang Wei
Assignee: Varun Saxena
   Original Estimate: 2,016h
  Remaining Estimate: 2,016h

 In some heavy computation clusters, user may continually submit lots of jobs, 
 in our scenario, there are 240k jobs per day. On average, 5 nodes will 
 participate in running a job. All these job's log file will be aggregated on 
 the hdfs. That is a big load for namenode. The total number of generated log 
 files in the default cleaning period (1 week) can be calculated as follows:
 AM logs per week: 7 days * 240,000 jobs/day * 2 files/job = 3360,000 files
 App logs per week: 7 days * 240,000 jobs/day * 5 nodes/job * 1 file/node = 
 8400,000 files
 There will be more than 10 million log files generated in one week. Even 
 worse, some environments have to keep the logs for potential issues tracking 
 for longer time. In general, these small log files will occupy about 12G heap 
 size of Namenode, and impact the response speed of Namenode.
 For optimizing the log management of history server, the main goals are:
 1)Reduce the total count of files in HDFS.
 2)Compatible with the former history server operation.
 As per the goals above, we can mine the detail demands as follows: 
 1)Merge log files into bigger ones in HDFS periodically.
 2)Optimized design should inherits from the original architecture to make 
 the merged logs transparent to be browsed.
 3)Merged logs should be aged periodically just like the common logs.
 The whole  life cycle of the AM logs:
 1.Created by Application Master in intermediate-done-dir.
 2.Moved to done-dir after the job is done.
 3.Archived to archived-dir  periodically.
 4.Cleaned when all the logs in harball are expired.
 The whole  life cycle of the App logs:
 1.Created by Applications in local-dirs.
 2.Aggregated to remote-app-log-dir after the job is done.
 3.Archived to archived-dir  periodically.
 4.Cleaned when all the logs in harball are expired. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6283) MRHistoryServer log files management optimization

2015-06-03 Thread Zhang Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Wei updated MAPREDUCE-6283:
-
Assignee: Varun Saxena  (was: Zhang Wei)

 MRHistoryServer log files management optimization
 -

 Key: MAPREDUCE-6283
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6283
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobhistoryserver
Reporter: Zhang Wei
Assignee: Varun Saxena
Priority: Minor
   Original Estimate: 2,016h
  Remaining Estimate: 2,016h

 In some heavy computation clusters, user may continually submit lots of jobs, 
 in our scenario, there are 240k jobs per day. On average, 5 nodes will 
 participate in running a job. All these job's log file will be aggregated on 
 the hdfs. That is a big load for namenode. The total number of generated log 
 files in the default cleaning period (1 week) can be calculated as follows:
 AM logs per week: 7 days * 240,000 jobs/day * 2 files/job = 3360,000 files
 App logs per week: 7 days * 240,000 jobs/day * 5 nodes/job * 1 file/node = 
 8400,000 files
 There will be more than 10 million log files generated in one week. Even 
 worse, some environments have to keep the logs for potential issues tracking 
 for longer time. In general, these small log files will occupy about 12G heap 
 size of Namenode, and impact the response speed of Namenode.
 For optimizing the log management of history server, the main goals are:
 1)Reduce the total count of files in HDFS.
 2)Compatible with the former history server operation.
 As per the goals above, we can mine the detail demands as follows: 
 1)Merge log files into bigger ones in HDFS periodically.
 2)Optimized design should inherits from the original architecture to make 
 the merged logs transparent to be browsed.
 3)Merged logs should be aged periodically just like the common logs.
 The whole  life cycle of the AM logs:
 1.Created by Application Master in intermediate-done-dir.
 2.Moved to done-dir after the job is done.
 3.Archived to archived-dir  periodically.
 4.Cleaned when all the logs in harball are expired.
 The whole  life cycle of the App logs:
 1.Created by Applications in local-dirs.
 2.Aggregated to remote-app-log-dir after the job is done.
 3.Archived to archived-dir  periodically.
 4.Cleaned when all the logs in harball are expired. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6283) MRHistoryServer log files management optimization

2015-03-23 Thread Zhang Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Wei updated MAPREDUCE-6283:
-
Description: 
In some heavy computation clusters, user may continually submit lots of jobs, 
in our scenario, there are 240k jobs per day. On average, 5 nodes will 
participate in running a job. All these job's log file will be aggregated on 
the hdfs. That is a big load for namenode. The total number of generated log 
files in the default cleaning period (1 week) can be calculated as follows:
AM logs per week: 7 days * 240,000 jobs/day * 2 files/job = 3360,000 files
App logs per week: 7 days * 240,000 jobs/day * 5 nodes/job * 1 file/node = 
8400,000 files
There will be more than 10 million log files generated in one week. Even worse, 
some environments have to keep the logs for potential issues tracking for 
longer time. In general, these small log files will occupy about 12G heap size 
of Namenode, and impact the response speed of Namenode.

For optimizing the log management of history server, the main goals are:
1)  Reduce the total count of files in HDFS.
2)  Compatible with the former history server operation.

As per the goals above, we can mine the detail demands as follows: 
1)  Merge log files into bigger ones in HDFS periodically.
2)  Optimized design should inherits from the original architecture to make 
the merged logs transparent to be browsed.
3)  Merged logs should be aged periodically just like the common logs.

  was:
In some heavy computation clusters, there might be a potential hdfs small files 
problem. The continually submitted MR jobs will create millions of log files 
(include the application master logs and application logs which are aggregated 
to hdfs by nodemanager). This optimization design helps to reduce the numbers 
of log files by merging them into bigger ones.

1.  Background
Running one MR job will output 2 types of logs, the Application Master(AM) logs 
and the Application logs.
AM Logs
AM (Application Master) logs are generated by MR Application Master, and will 
be outputted to HDFS directly. They recorded the starting time and running time 
of the jobs, and the starting time, running time and the Counter values of all 
tasks. These logs are managed and analyzed by MR History Server, and shown in 
the History Server job list page and job details page.
Each job will generate three log files in intermediate-done-dir, and a timed 
task will move  “.jhist” and “.xml” files to the final done-dir, “.summary” 
file will be deleted. So, the total number of logs in HDFS will be twice over 
the job number.
Application Logs
Application logs are generated by the applications which run in the Container 
of Node Manager. By default, Application logs will only be stored in local 
disks. By enabling the aggregation feature of Yarn, These logs can be 
aggregated to remote-app-log-dir in HDFS and deleted in local disks. 
During aggregation processing, all the logs in the same host will be merged to 
one log file which is named by the host name and Node Manager port. So, the 
total number of Application logs in HDFS will be the same with the number of 
Node Manager which participate in running this job.
2.  Problem Scenario
Take a heavy computation cluster for example: A cluster with hundreds of 
computing nodes, there will be more than 240,000 MR jobs submitted and running 
in this cluster in a day. On average, 5 nodes will participate in running a 
job. The total number of generated log files in the default cleaning period (1 
week) can be calculated as follows:
AM logs per week: 7 days * 240,000 jobs/day * 2 files/job = 3360,000 files
App logs per week: 7 days * 240,000 jobs/day * 5 nodes/job * 1 file/node = 
8400,000 files
There will be more than 10 million log files generated in one week. Even worse, 
some environments have to keep the logs for potential issues tracking for 
longer time. In general, these small log files will occupy about 12G heap size 
of Namenode, and impact the response speed of Namenode.

3.  Design Goals  Demands
For optimizing the log management of history server, the main goals are:
1)  Reduce the total count of files in HDFS.
2)  Do the least source code changes to reach goal 1.
3)  Compatible with the former history server operation.

As per the goals above, we can mine the detail demands as follows: 
1)  Merge log files into bigger ones in HDFS periodically.
2)  Optimized design should Inherits from the original architecture to make 
the merged logs transparent to be browsed.
3)  Merged logs should be aged periodically just like the common logs.

4.  Overall Design
Both AM logs and aggregated App logs are stored in HDFS. And the logs will be 
read many times but with no requirements for editing. As a distributed file 
system, HDFS is not intended for storing huge number of small files. Hadoop 

[jira] [Updated] (MAPREDUCE-6283) MRHistoryServer log files management optimization

2015-03-23 Thread Zhang Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Wei updated MAPREDUCE-6283:
-
Description: 
In some heavy computation clusters, user may continually submit lots of jobs, 
in our scenario, there are 240k jobs per day. On average, 5 nodes will 
participate in running a job. All these job's log file will be aggregated on 
the hdfs. That is a big load for namenode. The total number of generated log 
files in the default cleaning period (1 week) can be calculated as follows:
AM logs per week: 7 days * 240,000 jobs/day * 2 files/job = 3360,000 files
App logs per week: 7 days * 240,000 jobs/day * 5 nodes/job * 1 file/node = 
8400,000 files
There will be more than 10 million log files generated in one week. Even worse, 
some environments have to keep the logs for potential issues tracking for 
longer time. In general, these small log files will occupy about 12G heap size 
of Namenode, and impact the response speed of Namenode.

For optimizing the log management of history server, the main goals are:
1)  Reduce the total count of files in HDFS.
2)  Compatible with the former history server operation.

As per the goals above, we can mine the detail demands as follows: 
1)  Merge log files into bigger ones in HDFS periodically.
2)  Optimized design should inherits from the original architecture to make 
the merged logs transparent to be browsed.
3)  Merged logs should be aged periodically just like the common logs.

The whole  life cycle of the AM logs:
1.Created by Application Master in intermediate-done-dir.
2.Moved to done-dir after the job is done.
3.Archived to archived-dir  periodically.
4.Cleaned when all the logs in harball are expired.

The whole  life cycle of the App logs:
1.Created by Applications in local-dirs.
2.Aggregated to remote-app-log-dir after the job is done.
3.Archived to archived-dir  periodically.
4.Cleaned when all the logs in harball are expired. 

  was:
In some heavy computation clusters, user may continually submit lots of jobs, 
in our scenario, there are 240k jobs per day. On average, 5 nodes will 
participate in running a job. All these job's log file will be aggregated on 
the hdfs. That is a big load for namenode. The total number of generated log 
files in the default cleaning period (1 week) can be calculated as follows:
AM logs per week: 7 days * 240,000 jobs/day * 2 files/job = 3360,000 files
App logs per week: 7 days * 240,000 jobs/day * 5 nodes/job * 1 file/node = 
8400,000 files
There will be more than 10 million log files generated in one week. Even worse, 
some environments have to keep the logs for potential issues tracking for 
longer time. In general, these small log files will occupy about 12G heap size 
of Namenode, and impact the response speed of Namenode.

For optimizing the log management of history server, the main goals are:
1)  Reduce the total count of files in HDFS.
2)  Compatible with the former history server operation.

As per the goals above, we can mine the detail demands as follows: 
1)  Merge log files into bigger ones in HDFS periodically.
2)  Optimized design should inherits from the original architecture to make 
the merged logs transparent to be browsed.
3)  Merged logs should be aged periodically just like the common logs.


 MRHistoryServer log files management optimization
 -

 Key: MAPREDUCE-6283
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6283
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobhistoryserver
Reporter: Zhang Wei
Assignee: Zhang Wei
Priority: Minor
   Original Estimate: 2,016h
  Remaining Estimate: 2,016h

 In some heavy computation clusters, user may continually submit lots of jobs, 
 in our scenario, there are 240k jobs per day. On average, 5 nodes will 
 participate in running a job. All these job's log file will be aggregated on 
 the hdfs. That is a big load for namenode. The total number of generated log 
 files in the default cleaning period (1 week) can be calculated as follows:
 AM logs per week: 7 days * 240,000 jobs/day * 2 files/job = 3360,000 files
 App logs per week: 7 days * 240,000 jobs/day * 5 nodes/job * 1 file/node = 
 8400,000 files
 There will be more than 10 million log files generated in one week. Even 
 worse, some environments have to keep the logs for potential issues tracking 
 for longer time. In general, these small log files will occupy about 12G heap 
 size of Namenode, and impact the response speed of Namenode.
 For optimizing the log management of history server, the main goals are:
 1)Reduce the total count of files in HDFS.
 2)Compatible with the former history server operation.
 As per the goals above, we can mine the detail demands as follows: 
 1)

[jira] [Updated] (MAPREDUCE-6283) MRHistoryServer log files management optimization

2015-03-20 Thread Zhang Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Wei updated MAPREDUCE-6283:
-
Description: 
In some heavy computation clusters, there might be a potential hdfs small files 
problem. The continually submitted MR jobs will create millions of log files 
(include the application master logs and application logs which are aggregated 
to hdfs by nodemanager). This optimization design helps to reduce the numbers 
of log files by merging them into bigger ones.

1.  Background
Running one MR job will output 2 types of logs, the Application Master(AM) logs 
and the Application logs.
AM Logs
AM (Application Master) logs are generated by MR Application Master, and will 
be outputted to HDFS directly. They recorded the starting time and running time 
of the jobs, and the starting time, running time and the Counter values of all 
tasks. These logs are managed and analyzed by MR History Server, and shown in 
the History Server job list page and job details page.
Each job will generate three log files in intermediate-done-dir, and a timed 
task will move  “.jhist” and “.xml” files to the final done-dir, “.summary” 
file will be deleted. So, the total number of logs in HDFS will be twice over 
the job number.
Application Logs
Application logs are generated by the applications which run in the Container 
of Node Manager. By default, Application logs will only be stored in local 
disks. By enabling the aggregation feature of Yarn, These logs can be 
aggregated to remote-app-log-dir in HDFS and deleted in local disks. 
During aggregation processing, all the logs in the same host will be merged to 
one log file which is named by the host name and Node Manager port. So, the 
total number of Application logs in HDFS will be the same with the number of 
Node Manager which participate in running this job.
2.  Problem Scenario
Take a heavy computation cluster for example: A cluster with hundreds of 
computing nodes, there will be more than 240,000 MR jobs submitted and running 
in this cluster in a day. On average, 5 nodes will participate in running a 
job. The total number of generated log files in the default cleaning period (1 
week) can be calculated as follows:
AM logs per week: 7 days * 240,000 jobs/day * 2 files/job = 3360,000 files
App logs per week: 7 days * 240,000 jobs/day * 5 nodes/job * 1 file/node = 
8400,000 files
There will be more than 10 million log files generated in one week. Even worse, 
some environments have to keep the logs for potential issues tracking for 
longer time. In general, these small log files will occupy about 12G heap size 
of Namenode, and impact the response speed of Namenode.

3.  Design Goals  Demands
For optimizing the log management of history server, the main goals are:
1)  Reduce the total count of files in HDFS.
2)  Do the least source code changes to reach goal 1.
3)  Compatible with the former history server operation.

As per the goals above, we can mine the detail demands as follows: 
1)  Merge log files into bigger ones in HDFS periodically.
2)  Optimized design should Inherits from the original architecture to make 
the merged logs transparent to be browsed.
3)  Merged logs should be aged periodically just like the common logs.

4.  Overall Design
Both AM logs and aggregated App logs are stored in HDFS. And the logs will be 
read many times but with no requirements for editing. As a distributed file 
system, HDFS is not intended for storing huge number of small files. Hadoop 
Archives is one of the solutions to solve the small files problem. Archives the 
logs periodically into large files and delete the original small log files can 
rapidly reduce the numbers of files. Regarding the problem scenario above, 
suppose the archive task runs every 24h, there will be only less than 1 
thousand files (blocks) increased every day.

Archive log files
The archive task will be triggered by the Timer every 24h by default (can be 
configured). If there are more than 1000 (can be configured) jobs in the 
aggregated dir, then the archive task will archive them into large files, and 
then delete the original files.

Browse archived logs
File in Hadoop Archive format can be accessed through HDFS API transparently. 
So archived log files can be read by the History Server compatibly with only a 
little core code updating. User can browse archived logs in the front page as 
before.

Clean archived logs
Archived logs will be deleted by the cleaning task of History Server. The 
cleaning task is triggered by Timer every 24h by default (can be configured). 
If all of the log files in this archive package are expired (expired time can 
be configured), then this package will be deleted by this task immediately.
5.  Detailed Design
5.1 AM logs
5.1.1 Archive AM logs
 
1.  The archiving thread is periodically called by a Scheduler 

[jira] [Updated] (MAPREDUCE-6283) MRHistoryServer log files management optimization

2015-03-20 Thread Zhang Wei (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhang Wei updated MAPREDUCE-6283:
-
Description: 
In some heavy computation clusters, there might be a potential hdfs small files 
problem. The continually submitted MR jobs will create millions of log files 
(include the application master logs and application logs which are aggregated 
to hdfs by nodemanager). This optimization design helps to reduce the numbers 
of log files by merging them into bigger ones.

MR-HistoryServer Optimization
1.  Background
Running one MR job will output 2 types of logs, the Application Master(AM) logs 
and the Application logs.
AM Logs
AM (Application Master) logs are generated by MR Application Master, and will 
be outputted to HDFS directly. They recorded the starting time and running time 
of the jobs, and the starting time, running time and the Counter values of all 
tasks. These logs are managed and analyzed by MR History Server, and shown in 
the History Server job list page and job details page.
Each job will generate three log files in intermediate-done-dir, and a timed 
task will move  “.jhist” and “.xml” files to the final done-dir, “.summary” 
file will be deleted. So, the total number of logs in HDFS will be twice over 
the job number.
Application Logs
Application logs are generated by the applications which run in the Container 
of Node Manager. By default, Application logs will only be stored in local 
disks. By enabling the aggregation feature of Yarn, These logs can be 
aggregated to remote-app-log-dir in HDFS and deleted in local disks. 
During aggregation processing, all the logs in the same host will be merged to 
one log file which is named by the host name and Node Manager port. So, the 
total number of Application logs in HDFS will be the same with the number of 
Node Manager which participate in running this job.
2.  Problem Scenario
Take a heavy computation cluster for example: A cluster with hundreds of 
computing nodes, there will be more than 240,000 MR jobs submitted and running 
in this cluster in a day. On average, 5 nodes will participate in running a 
job. The total number of generated log files in the default cleaning period (1 
week) can be calculated as follows:
AM logs per week: 7 days * 240,000 jobs/day * 2 files/job = 3360,000 files
App logs per week: 7 days * 240,000 jobs/day * 5 nodes/job * 1 file/node = 
8400,000 files
There will be more than 10 million log files generated in one week. Even worse, 
some environments have to keep the logs for potential issues tracking for 
longer time. In general, these small log files will occupy about 12G heap size 
of Namenode, and impact the response speed of Namenode.

3.  Design Goals  Demands
For optimizing the log management of history server, the main goals are:
1)  Reduce the total count of files in HDFS.
2)  Do the least source code changes to reach goal 1.
3)  Compatible with the former history server operation.

As per the goals above, we can mine the detail demands as follows: 
1)  Merge log files into bigger ones in HDFS periodically.
2)  Optimized design should Inherits from the original architecture to make 
the merged logs transparent to be browsed.
3)  Merged logs should be aged periodically just like the common logs.

4.  Overall Design
Both AM logs and aggregated App logs are stored in HDFS. And the logs will be 
read many times but with no requirements for editing. As a distributed file 
system, HDFS is not intended for storing huge number of small files. Hadoop 
Archives is one of the solutions to solve the small files problem. Archives the 
logs periodically into large files and delete the original small log files can 
rapidly reduce the numbers of files. Regarding the problem scenario above, 
suppose the archive task runs every 24h, there will be only less than 1 
thousand files (blocks) increased every day.

Archive log files
The archive task will be triggered by the Timer every 24h by default (can be 
configured). If there are more than 1000 (can be configured) jobs in the 
aggregated dir, then the archive task will archive them into large files, and 
then delete the original files.

Browse archived logs
File in Hadoop Archive format can be accessed through HDFS API transparently. 
So archived log files can be read by the History Server compatibly with only a 
little core code updating. User can browse archived logs in the front page as 
before.

Clean archived logs
Archived logs will be deleted by the cleaning task of History Server. The 
cleaning task is triggered by Timer every 24h by default (can be configured). 
If all of the log files in this archive package are expired (expired time can 
be configured), then this package will be deleted by this task immediately.
5.  Detailed Design
5.1 AM logs
5.1.1 Archive AM logs
 
1.  The archiving thread is