Ferdy, Great!! Thanks for your reply!!
On Fri, Aug 3, 2012 at 8:55 AM, Ferdy Galema <[email protected]>wrote: > Hi, > > About the fetch process, this is not necessarily the last place that holds > the entire DOM representation of a page. (If this is what you mean with > full page). Actually it is only done when parsing during fetch is set to > true, otherwise it is not loaded at all. A separate (re)parser job is able > to load the DOM too. > > Ferdy > > On Fri, Aug 3, 2012 at 1:43 PM, X3C TECH <[email protected]> wrote: > > > Hello, > > I wanted to know at what point does Nutch stop keeping the HTML page? My > > issue is I need to be able to extract certain info from a page, for > > example: > > <username> > > <description> > > <photo> > > <profile link> > > there may be multiple profiles on each page, and my understanding is > > currently Nutch has an issue with multiple page fields with the same > name. > > My thinking was based on https://issues.apache.org/jira/browse/NUTCH-978 > . > > I > > was thinking of intercepting an HTML page and converting it to XML before > > parsing. I'm assuming that this would fit between fetch and parse. Few > > questions I have though: > > 1. Am I correct in thinking that Fetch is the last process that keeps a > > full HTML page with all tags, etc intact? > > 2. Does Nutch parse XML (I did't see an explicit plugin for that)? And if > > so, are there any issues known for multiple fields with the same name in > > the XML tree? I see that Tika has one, but it seems to parse just like an > > HTML page > > 3. Does the Plugin Tutorial still apply to Nutch 2.0 or is it only for > > previous versions? > > > > Thanks for your help > > > > Iggy > > >

