When a segment is written to HDFS, it's written into different directories, one of these is "content". As far as I understand, content contains sequence files where the key is URL and the value is the URL's raw content. Is that so ? If I want to read the entire raw content of a URL crawled, can I MapReduce over the segments created by the crawl and read only the "content" directory ?
Thanks, Amit.

