Yes. Check the Content class for it's format, the key is class Text and 
contains the url. 
 
-----Original message-----
> From:Amit Sela <[email protected]>
> Sent: Monday 30th September 2013 17:58
> To: [email protected]
> Subject: Page content in segment
> 
> When a segment is written to HDFS, it's written into different directories,
> one of these is "content".
> As far as I understand,  content contains sequence files where the key is
> URL and the value is the URL's raw content. Is that so ?
> If I want to read the entire raw content of a URL crawled, can I MapReduce
> over the segments created by the crawl and read only the "content"
> directory ?
> 
> Thanks,
> 
> Amit.
> 

Reply via email to