Hi, HDFS splits files into blocks, and mapreduce runs a map task for each block. However, Fields could be changed in IIS log files, which means fields in one block may depend on another, and thus make it not suitable for mapreduce job. It seems there should be some preprocess before storing and analyzing the IIS log files. We plan to parse each line to the same fields and store in Avro files with compression. Any other alternatives? Hbase? or any suggestions on analyzing IIS log files?
thanks!
