Re: parse-html plugin

Markus Jelsma Tue, 01 Feb 2011 09:42:57 -0800

Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter. Then you 
can retrieve whatever you need and store it in the ParseResult object.


On Tuesday 01 February 2011 15:25:20 a a wrote:
> hi,
> 
> is my question so difficult ?
> no one have an idea ?
> 
> thx
> 
> 
> mehdi
> 
> > From: [email protected]
> > To: [email protected]
> > Subject: RE: parse-html plugin
> > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > 
> > 
> > Hi All,
> > 
> > any  idea ?
> > 
> > 
> > 
> > mehdi
> > 
> > > From: [email protected]
> > > To: [email protected]
> > > Subject: parse-html plugin
> > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > 
> > > 
> > > hi,
> > > In the class HtmlParser I changed the 'text' variable to index only a
> > > part of my html page, and since i did lost lot off outlinks !
> > > 
> > > ...
> > > 
> > >  utils.getText(sb,extractIndexableContent(root));  //added on
> > >  26-01-2011 to extract only text inside <col_centre>
> > >  
> > >   // utils.getText(sb, root);          // extract text   --- disabled
> > >   on 26-01-2011-
> > >   
> > >       text = sb.toString();
> > > 
> > > ...
> > > 
> > > i beleived that outlinks are not obtained from the text variable ?!  in
> > > the same class we could see how outlinks are extracted !
> > > 
> > > 
> > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract outlinks
> > > 
> > >       URL baseTag = utils.getBase(root);
> > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
> > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
> > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > 
> > > can you plz tell me what i did wrong.
> > > 
> > > 
> > > mehdi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: parse-html plugin

Reply via email to