A few things: Storing simple, singular text records into sequence files isn't optimal, as you're just adding overheads for every line of text stored as Text type in it. If you have typed data and can benefit from type-based serializations for each record, go for a container format like SequenceFiles (With whatever serialization technique) or Avro DataFiles (Has embedded schema support, among other niceties).
When comparing the result with Lzo, also factor in the indexing time as thats part of the requirement in making it parallel (I think the newer libs auto-index, but thats just what I heard was the plan, dunno if its already available). On Fri, Sep 7, 2012 at 4:55 AM, Young-Geun Park <[email protected]> wrote: > Hi, All > > I have tested which method is better between Lzo and SequenceFile for a BIG > file. > > File size is 10GiB and WordCount MR is used. > Inputs of WordCount MR are lzo which would be indexed by LzoIndexTool(lzo), > sequence file which is compressed by block level snappy(seq) , and > uncompressed original file(none). > > Map output is compressed except of uncompressed file. mapreduce output is > not compressed for all cases. > > The following are wordcount MR running time; > none lzo seq > 248s 243s 1410s > > -Test Environments > > OS : CentOS 5.6 (x64) (kernel = 2.6.18) > # of Core : 8 (cpu = Intel(R) Xeon(R) CPU E5504 @ 2.00GHz) > RAM : 18GB > Java version : 1.6.0_26 > Hadoop version : CDH3U2 > # of datanode(tasktracker) : 8 > > According to the result, The running time of SequnceFile is much less than > the others. > Before testing, I had expected that the results of both SequenceFile and > Lzo are about the same. > > I want to know why performance of the sequence file compressed by snappy is > so bad? > > do I miss anything in tests? > > > Regards, > Park > > -- Harsh J
