Thanks, Peyman. The problem is that the dependence is not simply a key, instead it's so complicated that without "#Fields" line in one block, it's not even able to parse any line in another block.
2014/1/1 Peyman Mohajerian <[email protected]> > You can run a series of map-reduce jobs on your data, if some log line is > related to another line, e.g. based on sessionId, you can emit the > sessionId as the key of your mapper output with the value being on the rows > associated with the sessionId, so on the reducer side data from different > blocks will be coming together. Of course that is just one example, so the > fact that you have file content being split doesn't impact your analysis if > you have inter-dependencies. > > > On Mon, Dec 30, 2013 at 7:31 PM, Fengyun RAO <[email protected]> wrote: > >> Thanks, I understand now, but I don't think this is what we need. The IIS >> log files are very big (e.g, serveral GB per file), we need to split them >> for parallel processing. However, this could be used as some sort of >> preprocessing, to transform the original log files to splitable files such >> as Avro files. >> >> >> >> >> 2013/12/31 java8964 <[email protected]> >> >>> Google "Hadoop WholeFileInputFormat" or search it in book " Hadoop: The >>> Definitive >>> Guide<http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA242&lpg=PA242&dq=hadoop+definitive+guide+WholeFileInputFormat&source=bl&ots=i7BUTBU8Vw&sig=0m5effHuOY1kuqiRofqTbeEl7KU&hl=en&sa=X&ei=yijCUs_YLqHJsQSZ1oD4DQ&ved=0CD0Q6AEwAA> >>> " >>> >>> Yong >>> >>> >>> ------------------------------ >>> Date: Tue, 31 Dec 2013 09:39:58 +0800 >>> Subject: Re: any suggestions on IIS log storage and analysis? >>> >>> From: [email protected] >>> To: [email protected] >>> >>> Thanks, Yong! >>> >>> The dependence never cross files, but since HDFS splits files into >>> blocks, it may cross blocks, which makes it difficult to write MR job. I >>> don't quite understand what you mean by "WholeFileInputFormat ". >>> Actually, I have no idea how to deal with dependence across blocks. >>> >>> >>> 2013/12/31 java8964 <[email protected]> >>> >>> I don't know any example of IIS log files. But from what you described, >>> it looks like analyzing one line of log data depends on some previous lines >>> data. You should be more clear about what is this dependence and what you >>> are trying to do. >>> >>> Just based on your questions, you still have different options, which >>> one is better depends on your requirements and data. >>> >>> 1) You know the existing default TextInputFormat not suitable for your >>> case, you just need to find alternatives, or write your own. >>> 2) If the dependences never cross the files, just cross lines, you can >>> use WholeFileInputFormat (No such class coming from Hadoop itself, but very >>> easy implemented by yourself) >>> 3) If the dependences cross the files, then you maybe have to enforce >>> your business logics in reducer side, instead of mapper side. Without >>> knowing your detail requirements of this dependence, it is hard to give you >>> more detail, but you need to find out what are good KEY candidates for your >>> dependence logic, send the data based on that to the reducers, and enforce >>> your logic on the reducer sides. If one MR job is NOT enough to solve your >>> dependence, you may need chain several MR jobs together. >>> >>> Yong >>> >>> ------------------------------ >>> Date: Mon, 30 Dec 2013 15:58:57 +0800 >>> Subject: any suggestions on IIS log storage and analysis? >>> From: [email protected] >>> To: [email protected] >>> >>> >>> Hi, >>> >>> HDFS splits files into blocks, and mapreduce runs a map task for each >>> block. However, Fields could be changed in IIS log files, which means >>> fields in one block may depend on another, and thus make it not suitable >>> for mapreduce job. It seems there should be some preprocess before storing >>> and analyzing the IIS log files. We plan to parse each line to the same >>> fields and store in Avro files with compression. Any other alternatives? >>> Hbase? or any suggestions on analyzing IIS log files? >>> >>> thanks! >>> >>> >>> >>> >> >
