Page content in segment

Amit Sela Mon, 30 Sep 2013 08:58:41 -0700

When a segment is written to HDFS, it's written into different directories,
one of these is "content".
As far as I understand,  content contains sequence files where the key is
URL and the value is the URL's raw content. Is that so ?
If I want to read the entire raw content of a URL crawled, can I MapReduce
over the segments created by the crawl and read only the "content"
directory ?


Thanks,

Amit.

Page content in segment

Reply via email to