Yes. Check the Content class for it's format, the key is class Text and
contains the url.
-----Original message-----
> From:Amit Sela <[email protected]>
> Sent: Monday 30th September 2013 17:58
> To: [email protected]
> Subject: Page content in segment
>
> When a segment is written to HDFS, it's written into different directories,
> one of these is "content".
> As far as I understand, content contains sequence files where the key is
> URL and the value is the URL's raw content. Is that so ?
> If I want to read the entire raw content of a URL crawled, can I MapReduce
> over the segments created by the crawl and read only the "content"
> directory ?
>
> Thanks,
>
> Amit.
>