Hi, About the fetch process, this is not necessarily the last place that holds the entire DOM representation of a page. (If this is what you mean with full page). Actually it is only done when parsing during fetch is set to true, otherwise it is not loaded at all. A separate (re)parser job is able to load the DOM too.
Ferdy On Fri, Aug 3, 2012 at 1:43 PM, X3C TECH <[email protected]> wrote: > Hello, > I wanted to know at what point does Nutch stop keeping the HTML page? My > issue is I need to be able to extract certain info from a page, for > example: > <username> > <description> > <photo> > <profile link> > there may be multiple profiles on each page, and my understanding is > currently Nutch has an issue with multiple page fields with the same name. > My thinking was based on https://issues.apache.org/jira/browse/NUTCH-978. > I > was thinking of intercepting an HTML page and converting it to XML before > parsing. I'm assuming that this would fit between fetch and parse. Few > questions I have though: > 1. Am I correct in thinking that Fetch is the last process that keeps a > full HTML page with all tags, etc intact? > 2. Does Nutch parse XML (I did't see an explicit plugin for that)? And if > so, are there any issues known for multiple fields with the same name in > the XML tree? I see that Tika has one, but it seems to parse just like an > HTML page > 3. Does the Plugin Tutorial still apply to Nutch 2.0 or is it only for > previous versions? > > Thanks for your help > > Iggy >

