Re: Crawl and extract data

Lewis John Mcgibbney Sat, 07 Apr 2012 03:29:20 -0700

Hi Mansour

On Fri, Apr 6, 2012 at 2:04 PM, Mansour Al Akeel
<[email protected]>wrote:


> Using "nutch readseg -dump", on each segment doesn't give me the
> (x)html page as is. This is an issue.
>

You mean that you want raw xhtml? You can try the parsechecker tool to see
that you are actually fetching the correct information. To begin with.
Maybe you need to use some of the SegmentReader's general options to filter
out the segment directories which may or may not contain the raw parsed
xhtml... can you please try this.

http://wiki.apache.org/nutch/bin/nutch_readseg


> I don't need to generate indexes for solr as I am not going to search
> those pages.
>

I understand.

Re: Crawl and extract data

Reply via email to