Thx for your reply :) so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to overwrite to ParseResult varaible of the original plugin parser-html ?
is it not going to spend more time doing twice the operation of extracting the html source code of each url to parse it (first time the original parse-html plugin and the seconde time my new plugin ) ?? thx a lot mehdi > From: [email protected] > To: [email protected] > Subject: Re: parse-html plugin > Date: Tue, 1 Feb 2011 18:42:51 +0100 > CC: [email protected] > > Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter. Then > you > can retrieve whatever you need and store it in the ParseResult object. > > On Tuesday 01 February 2011 15:25:20 a a wrote: > > hi, > > > > is my question so difficult ? > > no one have an idea ? > > > > thx > > > > > > mehdi > > > > > From: [email protected] > > > To: [email protected] > > > Subject: RE: parse-html plugin > > > Date: Mon, 31 Jan 2011 16:05:22 +0000 > > > > > > > > > Hi All, > > > > > > any idea ? > > > > > > > > > > > > mehdi > > > > > > > From: [email protected] > > > > To: [email protected] > > > > Subject: parse-html plugin > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000 > > > > > > > > > > > > hi, > > > > In the class HtmlParser I changed the 'text' variable to index only a > > > > part of my html page, and since i did lost lot off outlinks ! > > > > > > > > ... > > > > > > > > utils.getText(sb,extractIndexableContent(root)); //added on > > > > 26-01-2011 to extract only text inside <col_centre> > > > > > > > > // utils.getText(sb, root); // extract text --- disabled > > > > on 26-01-2011- > > > > > > > > text = sb.toString(); > > > > > > > > ... > > > > > > > > i beleived that outlinks are not obtained from the text variable ?! in > > > > the same class we could see how outlinks are extracted ! > > > > > > > > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks > > > > > > > > URL baseTag = utils.getBase(root); > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); } > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root); > > > > outlinks = l.toArray(new Outlink[l.size()]); > > > > > > > > can you plz tell me what i did wrong. > > > > > > > > > > > > mehdi > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350

