Hi Mansour On Fri, Apr 6, 2012 at 2:04 PM, Mansour Al Akeel <[email protected]>wrote:
> Using "nutch readseg -dump", on each segment doesn't give me the > (x)html page as is. This is an issue. > You mean that you want raw xhtml? You can try the parsechecker tool to see that you are actually fetching the correct information. To begin with. Maybe you need to use some of the SegmentReader's general options to filter out the segment directories which may or may not contain the raw parsed xhtml... can you please try this. http://wiki.apache.org/nutch/bin/nutch_readseg > I don't need to generate indexes for solr as I am not going to search > those pages. > I understand.

