When a segment is written to HDFS, it's written into different directories,
one of these is "content".
As far as I understand,  content contains sequence files where the key is
URL and the value is the URL's raw content. Is that so ?
If I want to read the entire raw content of a URL crawled, can I MapReduce
over the segments created by the crawl and read only the "content"
directory ?

Thanks,

Amit.

Reply via email to