If i'm not mistaken its not a plugin but an extension point. Maybe it doesn't need configuration but only inclusion on the class path?
> i'm realy confused :) > > so in my nutch-site.xml i have to call my new plugin after parse-html one, > like this > > parse-(text|html|msword|pdf|MY_NEW_HtmlParsefilter _PLUGIN) > > how about parse-text? it has also a parseresult object as the parse-html ? > which one is used ? > > thx > > > mehdi > > > Date: Wed, 2 Feb 2011 13:31:40 +0800 > > Subject: Re: parse-html plugin > > From: [email protected] > > To: [email protected] > > > > Hi, > > > > I am not sure if my guess would be right hopefully some one will have to > > > > correct me if I am a wrong, I am just a beginner. > > > > I believe you would be implementing your own HtmlParseFilter as a > > plug-in > > > > in which case the order in which the plug-in is executed has a call on > > impact. I see some implementation on ordered filters in the > > HtmlParseFilters class. If my assumption on this is correct, you may > > want to order it as per your requirements. > > > > However, I am not really sure what determines the order or whether it > > will > > > > take double(more) time for phase by phase filtering. Even I am looking > > out for an answer to this :) > > > > Thanks, > > Abi > > > > On Wed, Feb 2, 2011 at 11:28 AM, a a <[email protected]> wrote: > > > i want to know if some one did this job before , mabe he could tell us > > > if it will take more time (double time) when using another > > > HtmlParsefilter to overwrite the original ParseResult object > > > produced by the parse-html plugin. > > > > > > thx > > > > > > > > > mehdi > > > > > > > From: [email protected] > > > > To: [email protected] > > > > Subject: Re: parse-html plugin > > > > Date: Wed, 2 Feb 2011 02:46:47 +0100 > > > > CC: [email protected] > > > > > > > > Oh well, please come back with your experience and results on this > > > > issue > > > > > > in > > > > > > > this thread. More users will benefit =) > > > > > > > > > I am sorry, forgive my ignorance. I got the answer for it :) Thanks > > > > > for your time > > > > > > > > > > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <[email protected]> > > > > > > wrote: > > > > > > Hi, > > > > > > > > > > > > Just wondering what does the dumpText mean in the ParseChecker? > > > > > > > > > > > > On the same grounds, incase I am writing a custom filter that > > > > > > extends > > > > > > > > > the > > > > > > > > > > > > HtmlParseFilter..do I have to make any configuration changes for > > > > > > nutch? > > > > > > > > > Thanks, > > > > > > Abi > > > > > > > > > > > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma > > > > > > > > <[email protected]>wrote: > > > > > >> I'm not really sure but i believe you must overwrite the already > > > > > > parsed > > > > > > > > >> data > > > > > >> yourself in your filter. > > > > > >> > > > > > >> On Tuesday 01 February 2011 18:54:32 a a wrote: > > > > > >> > Thx for your reply :) > > > > > >> > > > > > > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is > > > > > >> > it > > > > > > going > > > > > > > > >> > to overwrite to ParseResult varaible of the original plugin > > > > > >> > parser-html ? > > > > > >> > > > > > > >> > is it not going to spend more time doing twice the operation > > > > > >> > of > > > > > >> > > > > > >> extracting > > > > > >> > > > > > >> > the html source code of each url to parse it (first time the > > > > > > original > > > > > > > > >> > parse-html plugin and the seconde time my new plugin ) ?? > > > > > >> > > > > > > >> > thx a lot > > > > > >> > > > > > > >> > mehdi > > > > > >> > > > > > > >> > > From: [email protected] > > > > > >> > > To: [email protected] > > > > > >> > > Subject: Re: parse-html plugin > > > > > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100 > > > > > >> > > CC: [email protected] > > > > > >> > > > > > > > >> > > Oh, i forgot. You could extend > > > > > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can > > > > > >> > > retrieve whatever you need and store it in the > > > > > >> > > > > > >> ParseResult > > > > > >> > > > > > >> > > object. > > > > > >> > > > > > > > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote: > > > > > >> > > > hi, > > > > > >> > > > > > > > > >> > > > is my question so difficult ? > > > > > >> > > > no one have an idea ? > > > > > >> > > > > > > > > >> > > > thx > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > mehdi > > > > > >> > > > > > > > > >> > > > > From: [email protected] > > > > > >> > > > > To: [email protected] > > > > > >> > > > > Subject: RE: parse-html plugin > > > > > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000 > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > Hi All, > > > > > >> > > > > > > > > > >> > > > > any idea ? > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > mehdi > > > > > >> > > > > > > > > > >> > > > > > From: [email protected] > > > > > >> > > > > > To: [email protected] > > > > > >> > > > > > Subject: parse-html plugin > > > > > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000 > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > hi, > > > > > >> > > > > > In the class HtmlParser I changed the 'text' variable > > > > > >> > > > > > to > > > > > > index > > > > > > > > >> only > > > > > >> > > > > > >> > > > > > a part of my html page, and since i did lost lot off > > > > > > outlinks > > > > > > > > >> > > > > > ! > > > > > >> > > > > > > > > > > >> > > > > > ... > > > > > >> > > > > > > > > > > >> > > > > > utils.getText(sb,extractIndexableContent(root)); > > > > > >> > > > > > //added > > > > > > on > > > > > > > > >> > > > > > 26-01-2011 to extract only text inside <col_centre> > > > > > >> > > > > > > > > > > >> > > > > > // utils.getText(sb, root); // extract text > > > > > > --- > > > > > > > > >> > > > > > disabled on 26-01-2011- > > > > > >> > > > > > > > > > > >> > > > > > text = sb.toString(); > > > > > >> > > > > > > > > > > >> > > > > > ... > > > > > >> > > > > > > > > > > >> > > > > > i beleived that outlinks are not obtained from the > > > > > >> > > > > > text variable > > > > > >> > > > > > >> ?! > > > > > >> > > > > > >> > > > > > in the same class we could see how outlinks are > > > > > >> > > > > > extracted > > > > > > ! > > > > > > > > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>(); // > > > > > > extract > > > > > > > > >> > > > > > outlinks > > > > > >> > > > > > > > > > > >> > > > > > URL baseTag = utils.getBase(root); > > > > > >> > > > > > if (LOG.isTraceEnabled()) { LOG.trace("Getting > > > > > >> > > > > > links..."); > > > > > >> > > > > > >> } > > > > > >> > > > > > >> > > > > > utils.getOutlinks(baseTag!=null?baseTag:base, l, > > > > > > root); > > > > > > > > >> > > > > > outlinks = l.toArray(new Outlink[l.size()]); > > > > > >> > > > > > > > > > > >> > > > > > can you plz tell me what i did wrong. > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > mehdi > > > > > >> > > > > > >> -- > > > > > >> Markus Jelsma - CTO - Openindex > > > > > >> http://www.linkedin.com/in/markus17 > > > > > >> 050-8536620 / 06-50258350

