Ferdy,
Great!! Thanks for your reply!!

On Fri, Aug 3, 2012 at 8:55 AM, Ferdy Galema <[email protected]>wrote:

> Hi,
>
> About the fetch process, this is not necessarily the last place that holds
> the entire DOM representation of a page. (If this is what you mean with
> full page). Actually it is only done when parsing during fetch is set to
> true, otherwise it is not loaded at all. A separate (re)parser job is able
> to load the DOM too.
>
> Ferdy
>
> On Fri, Aug 3, 2012 at 1:43 PM, X3C TECH <[email protected]> wrote:
>
> > Hello,
> > I wanted to know at what point does Nutch stop keeping the HTML page? My
> > issue is I need to be able to extract certain info from a page, for
> > example:
> > <username>
> > <description>
> > <photo>
> > <profile link>
> > there may be multiple profiles on each page, and my understanding is
> > currently Nutch has an issue with multiple page fields with the same
> name.
> > My thinking was based on https://issues.apache.org/jira/browse/NUTCH-978
> .
> > I
> > was thinking of intercepting an HTML page and converting it to XML before
> > parsing. I'm assuming that this would fit between fetch and parse. Few
> > questions I have though:
> > 1. Am I correct in thinking that Fetch is the last process that keeps a
> > full HTML page with all tags, etc intact?
> > 2. Does Nutch parse XML (I did't see an explicit plugin for that)? And if
> > so, are there any issues known for multiple fields with the same name in
> > the XML tree? I see that Tika has one, but it seems to parse just like an
> > HTML page
> > 3. Does the Plugin Tutorial still apply to Nutch 2.0 or is it only for
> > previous versions?
> >
> > Thanks for your help
> >
> > Iggy
> >
>

Reply via email to