I am sorry, forgive my ignorance. I got the answer for it :) Thanks for your time
On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <[email protected]> wrote: > Hi, > > Just wondering what does the dumpText mean in the ParseChecker? > > On the same grounds, incase I am writing a custom filter that extends the > HtmlParseFilter..do I have to make any configuration changes for nutch? > > Thanks, > Abi > > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma > <[email protected]>wrote: > >> I'm not really sure but i believe you must overwrite the already parsed >> data >> yourself in your filter. >> >> On Tuesday 01 February 2011 18:54:32 a a wrote: >> > Thx for your reply :) >> > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to >> > overwrite to ParseResult varaible of the original plugin parser-html ? >> > >> > is it not going to spend more time doing twice the operation of >> extracting >> > the html source code of each url to parse it (first time the original >> > parse-html plugin and the seconde time my new plugin ) ?? >> > >> > thx a lot >> > >> > mehdi >> > >> > > From: [email protected] >> > > To: [email protected] >> > > Subject: Re: parse-html plugin >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100 >> > > CC: [email protected] >> > > >> > > Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter. >> > > Then you can retrieve whatever you need and store it in the >> ParseResult >> > > object. >> > > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote: >> > > > hi, >> > > > >> > > > is my question so difficult ? >> > > > no one have an idea ? >> > > > >> > > > thx >> > > > >> > > > >> > > > mehdi >> > > > >> > > > > From: [email protected] >> > > > > To: [email protected] >> > > > > Subject: RE: parse-html plugin >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000 >> > > > > >> > > > > >> > > > > Hi All, >> > > > > >> > > > > any idea ? >> > > > > >> > > > > >> > > > > >> > > > > mehdi >> > > > > >> > > > > > From: [email protected] >> > > > > > To: [email protected] >> > > > > > Subject: parse-html plugin >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000 >> > > > > > >> > > > > > >> > > > > > hi, >> > > > > > In the class HtmlParser I changed the 'text' variable to index >> only >> > > > > > a part of my html page, and since i did lost lot off outlinks ! >> > > > > > >> > > > > > ... >> > > > > > >> > > > > > utils.getText(sb,extractIndexableContent(root)); //added on >> > > > > > 26-01-2011 to extract only text inside <col_centre> >> > > > > > >> > > > > > // utils.getText(sb, root); // extract text --- >> > > > > > disabled on 26-01-2011- >> > > > > > >> > > > > > text = sb.toString(); >> > > > > > >> > > > > > ... >> > > > > > >> > > > > > i beleived that outlinks are not obtained from the text variable >> ?! >> > > > > > in the same class we could see how outlinks are extracted ! >> > > > > > >> > > > > > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract >> > > > > > outlinks >> > > > > > >> > > > > > URL baseTag = utils.getBase(root); >> > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); >> } >> > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, root); >> > > > > > outlinks = l.toArray(new Outlink[l.size()]); >> > > > > > >> > > > > > can you plz tell me what i did wrong. >> > > > > > >> > > > > > >> > > > > > mehdi >> >> -- >> Markus Jelsma - CTO - Openindex >> http://www.linkedin.com/in/markus17 >> 050-8536620 / 06-50258350 >> > >

