Hi,

About the fetch process, this is not necessarily the last place that holds
the entire DOM representation of a page. (If this is what you mean with
full page). Actually it is only done when parsing during fetch is set to
true, otherwise it is not loaded at all. A separate (re)parser job is able
to load the DOM too.

Ferdy

On Fri, Aug 3, 2012 at 1:43 PM, X3C TECH <[email protected]> wrote:

> Hello,
> I wanted to know at what point does Nutch stop keeping the HTML page? My
> issue is I need to be able to extract certain info from a page, for
> example:
> <username>
> <description>
> <photo>
> <profile link>
> there may be multiple profiles on each page, and my understanding is
> currently Nutch has an issue with multiple page fields with the same name.
> My thinking was based on https://issues.apache.org/jira/browse/NUTCH-978.
> I
> was thinking of intercepting an HTML page and converting it to XML before
> parsing. I'm assuming that this would fit between fetch and parse. Few
> questions I have though:
> 1. Am I correct in thinking that Fetch is the last process that keeps a
> full HTML page with all tags, etc intact?
> 2. Does Nutch parse XML (I did't see an explicit plugin for that)? And if
> so, are there any issues known for multiple fields with the same name in
> the XML tree? I see that Tika has one, but it seems to parse just like an
> HTML page
> 3. Does the Plugin Tutorial still apply to Nutch 2.0 or is it only for
> previous versions?
>
> Thanks for your help
>
> Iggy
>

Reply via email to