Hi, I have a question related to sequence file. I wonder why I should use it 
under what kind of circumstance?
Let's say if I have a csv file, I can store that directly in HDFS. But if I do 
know that the first 2 fields are some kind of key, and most of MR jobs will 
query on that key, will it make sense to store the data as sequence file in 
this case? And what benefits it can bring?
Best benefit I want to get is to reduce the IO for MR job, but not sure if 
sequence file can give me that.If the data is stored as key/value pair in the 
sequence file, and since mapper/reducer will certain only use the key part 
mostly of time to compare/sort, what difference it makes if I just store as 
flat file, and only use the first 2 fields as the key?
In the mapper of the sequence file, anyway it will scan the whole content of 
the file. If only key part will be compared, do we save IO by NOT deserializing 
the value part, if some optimization done here? Sound like we can avoid 
deserializing value part when unnecessary. Is that the benefit? If not, why 
would I use key/value format, instead of just (Text, Text)? Assume that my data 
doesn't have any binary data.
Thanks

                                          

Reply via email to