Thanks, Yong! The dependence never cross files, but since HDFS splits files into blocks, it may cross blocks, which makes it difficult to write MR job. I don't quite understand what you mean by "WholeFileInputFormat ". Actually, I have no idea how to deal with dependence across blocks.
2013/12/31 java8964 <[email protected]> > I don't know any example of IIS log files. But from what you described, it > looks like analyzing one line of log data depends on some previous lines > data. You should be more clear about what is this dependence and what you > are trying to do. > > Just based on your questions, you still have different options, which one > is better depends on your requirements and data. > > 1) You know the existing default TextInputFormat not suitable for your > case, you just need to find alternatives, or write your own. > 2) If the dependences never cross the files, just cross lines, you can use > WholeFileInputFormat (No such class coming from Hadoop itself, but very > easy implemented by yourself) > 3) If the dependences cross the files, then you maybe have to enforce your > business logics in reducer side, instead of mapper side. Without knowing > your detail requirements of this dependence, it is hard to give you more > detail, but you need to find out what are good KEY candidates for your > dependence logic, send the data based on that to the reducers, and enforce > your logic on the reducer sides. If one MR job is NOT enough to solve your > dependence, you may need chain several MR jobs together. > > Yong > > ------------------------------ > Date: Mon, 30 Dec 2013 15:58:57 +0800 > Subject: any suggestions on IIS log storage and analysis? > From: [email protected] > To: [email protected] > > > Hi, > > HDFS splits files into blocks, and mapreduce runs a map task for each > block. However, Fields could be changed in IIS log files, which means > fields in one block may depend on another, and thus make it not suitable > for mapreduce job. It seems there should be some preprocess before storing > and analyzing the IIS log files. We plan to parse each line to the same > fields and store in Avro files with compression. Any other alternatives? > Hbase? or any suggestions on analyzing IIS log files? > > thanks! > > >
