I don't know any example of IIS log files. But from what you described, it 
looks like analyzing one line of log data depends on some previous lines data. 
You should be more clear about what is this dependence and what you are trying 
to do.
Just based on your questions, you still have different options, which one is 
better depends on your requirements and data.
1) You know the existing default TextInputFormat not suitable for your case, 
you just need to find alternatives, or write your own.2) If the dependences 
never cross the files, just cross lines, you can use WholeFileInputFormat (No 
such class coming from Hadoop itself, but very easy implemented by yourself)3) 
If the dependences cross the files, then you maybe have to enforce your 
business logics in reducer side, instead of mapper side. Without knowing your 
detail requirements of this dependence, it is hard to give you more detail, but 
you need to find out what are good KEY candidates for your dependence logic, 
send the data based on that to the reducers, and enforce your logic on the 
reducer sides. If one MR job is NOT enough to solve your dependence, you may 
need chain several MR jobs together.
Yong

Date: Mon, 30 Dec 2013 15:58:57 +0800
Subject: any suggestions on IIS log storage and analysis?
From: [email protected]
To: [email protected]

Hi,
HDFS splits files into blocks, and mapreduce runs a map task for each block. 
However, Fields could be changed in IIS log files, which means fields in one 
block may depend on another, and thus make it not suitable for mapreduce job. 
It seems there should be some preprocess before storing and analyzing the IIS 
log files. We plan to parse each line to the same fields and store in Avro 
files with compression. Any other alternatives? Hbase?  or any suggestions on 
analyzing IIS log files?

thanks!

                                          

Reply via email to