Re: parse-html plugin

Markus Jelsma Wed, 02 Feb 2011 13:17:07 -0800

If i'm not mistaken its not a plugin but an extension point. Maybe it doesn't 
need configuration but only inclusion on the class path?


> i'm realy confused :)
> 
> so in my nutch-site.xml  i have to call my new plugin after parse-html one,
> like this
> 
> parse-(text|html|msword|pdf|MY_NEW_HtmlParsefilter _PLUGIN)
> 
> how about parse-text? it has also a parseresult object as the parse-html ?
> which one is used ?
> 
> thx
> 
> 
> mehdi
> 
> > Date: Wed, 2 Feb 2011 13:31:40 +0800
> > Subject: Re: parse-html plugin
> > From: [email protected]
> > To: [email protected]
> > 
> > Hi,
> > 
> >  I am not sure if my guess would be right hopefully some one will have to
> > 
> > correct me if I am a wrong, I am just a beginner.
> > 
> >  I believe you would be implementing your own HtmlParseFilter as a
> >  plug-in
> > 
> > in which case the order in which the plug-in is executed has a call on
> > impact. I see some implementation on ordered filters in the
> > HtmlParseFilters class. If my assumption on this is correct, you may
> > want to order it as per your requirements.
> > 
> >  However, I am not really sure what determines the order or whether it
> >  will
> > 
> > take double(more) time for phase by phase filtering. Even I am looking
> > out for an answer to this :)
> > 
> > Thanks,
> > Abi
> > 
> > On Wed, Feb 2, 2011 at 11:28 AM, a a <[email protected]> wrote:
> > > i want to know if some one did this job before , mabe he could tell us
> > > if it will take more time  (double time) when using another
> > > HtmlParsefilter to overwrite  the original ParseResult   object
> > > produced by the parse-html plugin.
> > > 
> > > thx
> > > 
> > > 
> > > mehdi
> > > 
> > > > From: [email protected]
> > > > To: [email protected]
> > > > Subject: Re: parse-html plugin
> > > > Date: Wed, 2 Feb 2011 02:46:47 +0100
> > > > CC: [email protected]
> > > > 
> > > > Oh well, please come back with your experience and results on this
> > > > issue
> > > 
> > > in
> > > 
> > > > this thread. More users will benefit =)
> > > > 
> > > > > I am sorry, forgive my ignorance. I got the answer for it :) Thanks
> > > > > for your time
> > > > > 
> > > > > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <[email protected]>
> > > 
> > > wrote:
> > > > > > Hi,
> > > > > > 
> > > > > >  Just wondering what does the dumpText mean in the ParseChecker?
> > > > > >  
> > > > > >  On the same grounds, incase I am writing a custom filter that
> > > 
> > > extends
> > > 
> > > > > >  the
> > > > > > 
> > > > > > HtmlParseFilter..do I have to make any configuration changes for
> > > 
> > > nutch?
> > > 
> > > > > > Thanks,
> > > > > > Abi
> > > > > > 
> > > > > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma
> > > > 
> > > > <[email protected]>wrote:
> > > > > >> I'm not really sure but i believe you must overwrite the already
> > > 
> > > parsed
> > > 
> > > > > >> data
> > > > > >> yourself in your filter.
> > > > > >> 
> > > > > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > > > > >> > Thx for your reply :)
> > > > > >> > 
> > > > > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is
> > > > > >> > it
> > > 
> > > going
> > > 
> > > > > >> > to overwrite to ParseResult  varaible of the original plugin
> > > > > >> > parser-html ?
> > > > > >> > 
> > > > > >> > is it not going to spend more time doing twice the operation
> > > > > >> > of
> > > > > >> 
> > > > > >> extracting
> > > > > >> 
> > > > > >> > the html source code of each url to parse it  (first time the
> > > 
> > > original
> > > 
> > > > > >> > parse-html plugin and the seconde time my new plugin ) ??
> > > > > >> > 
> > > > > >> > thx a lot
> > > > > >> > 
> > > > > >> > mehdi
> > > > > >> > 
> > > > > >> > > From: [email protected]
> > > > > >> > > To: [email protected]
> > > > > >> > > Subject: Re: parse-html plugin
> > > > > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > > > > >> > > CC: [email protected]
> > > > > >> > > 
> > > > > >> > > Oh, i forgot. You could extend
> > > > > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can
> > > > > >> > > retrieve whatever you need and store it in the
> > > > > >> 
> > > > > >> ParseResult
> > > > > >> 
> > > > > >> > > object.
> > > > > >> > > 
> > > > > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > > > > >> > > > hi,
> > > > > >> > > > 
> > > > > >> > > > is my question so difficult ?
> > > > > >> > > > no one have an idea ?
> > > > > >> > > > 
> > > > > >> > > > thx
> > > > > >> > > > 
> > > > > >> > > > 
> > > > > >> > > > mehdi
> > > > > >> > > > 
> > > > > >> > > > > From: [email protected]
> > > > > >> > > > > To: [email protected]
> > > > > >> > > > > Subject: RE: parse-html plugin
> > > > > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > > > > >> > > > > 
> > > > > >> > > > > 
> > > > > >> > > > > Hi All,
> > > > > >> > > > > 
> > > > > >> > > > > any  idea ?
> > > > > >> > > > > 
> > > > > >> > > > > 
> > > > > >> > > > > 
> > > > > >> > > > > mehdi
> > > > > >> > > > > 
> > > > > >> > > > > > From: [email protected]
> > > > > >> > > > > > To: [email protected]
> > > > > >> > > > > > Subject: parse-html plugin
> > > > > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > > > > >> > > > > > 
> > > > > >> > > > > > 
> > > > > >> > > > > > hi,
> > > > > >> > > > > > In the class HtmlParser I changed the 'text' variable
> > > > > >> > > > > > to
> > > 
> > > index
> > > 
> > > > > >> only
> > > > > >> 
> > > > > >> > > > > > a part of my html page, and since i did lost lot off
> > > 
> > > outlinks
> > > 
> > > > > >> > > > > > !
> > > > > >> > > > > > 
> > > > > >> > > > > > ...
> > > > > >> > > > > > 
> > > > > >> > > > > >  utils.getText(sb,extractIndexableContent(root)); 
> > > > > >> > > > > >  //added
> > > 
> > > on
> > > 
> > > > > >> > > > > >  26-01-2011 to extract only text inside <col_centre>
> > > > > >> > > > > >  
> > > > > >> > > > > >   // utils.getText(sb, root);          // extract text
> > > 
> > > ---
> > > 
> > > > > >> > > > > >   disabled on 26-01-2011-
> > > > > >> > > > > >   
> > > > > >> > > > > >       text = sb.toString();
> > > > > >> > > > > > 
> > > > > >> > > > > > ...
> > > > > >> > > > > > 
> > > > > >> > > > > > i beleived that outlinks are not obtained from the
> > > > > >> > > > > > text variable
> > > > > >> 
> > > > > >> ?!
> > > > > >> 
> > > > > >> > > > > >  in the same class we could see how outlinks are
> > > > > >> > > > > >  extracted
> > > 
> > > !
> > > 
> > > > > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   //
> > > 
> > > extract
> > > 
> > > > > >> > > > > > outlinks
> > > > > >> > > > > > 
> > > > > >> > > > > >       URL baseTag = utils.getBase(root);
> > > > > >> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > > > > >> > > > > >       links...");
> > > > > >> 
> > > > > >> }
> > > > > >> 
> > > > > >> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l,
> > > 
> > > root);
> > > 
> > > > > >> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > > > > >> > > > > > 
> > > > > >> > > > > > can you plz tell me what i did wrong.
> > > > > >> > > > > > 
> > > > > >> > > > > > 
> > > > > >> > > > > > mehdi
> > > > > >> 
> > > > > >> --
> > > > > >> Markus Jelsma - CTO - Openindex
> > > > > >> http://www.linkedin.com/in/markus17
> > > > > >> 050-8536620 / 06-50258350

Re: parse-html plugin

Reply via email to