Re: parse-html plugin

.: Abhishek :. Tue, 01 Feb 2011 17:46:16 -0800

I am sorry, forgive my ignorance. I got the answer for it :) Thanks for your
time


On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <[email protected]> wrote:

> Hi,
>
>  Just wondering what does the dumpText mean in the ParseChecker?
>
>  On the same grounds, incase I am writing a custom filter that extends the
> HtmlParseFilter..do I have to make any configuration changes for nutch?
>
> Thanks,
> Abi
>
>
> On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma 
> <[email protected]>wrote:
>
>> I'm not really sure but i believe you must overwrite the already parsed
>> data
>> yourself in your filter.
>>
>> On Tuesday 01 February 2011 18:54:32 a a wrote:
>> > Thx for your reply :)
>> >
>> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going to
>> > overwrite to ParseResult  varaible of the original plugin parser-html ?
>> >
>> > is it not going to spend more time doing twice the operation of
>> extracting
>> > the html source code of each url to parse it  (first time the original
>> > parse-html plugin and the seconde time my new plugin ) ??
>> >
>> > thx a lot
>> >
>> > mehdi
>> >
>> > > From: [email protected]
>> > > To: [email protected]
>> > > Subject: Re: parse-html plugin
>> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
>> > > CC: [email protected]
>> > >
>> > > Oh, i forgot. You could extend org.apache.nutch.parse.HtmlParsefilter.
>> > > Then you can retrieve whatever you need and store it in the
>> ParseResult
>> > > object.
>> > >
>> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
>> > > > hi,
>> > > >
>> > > > is my question so difficult ?
>> > > > no one have an idea ?
>> > > >
>> > > > thx
>> > > >
>> > > >
>> > > > mehdi
>> > > >
>> > > > > From: [email protected]
>> > > > > To: [email protected]
>> > > > > Subject: RE: parse-html plugin
>> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
>> > > > >
>> > > > >
>> > > > > Hi All,
>> > > > >
>> > > > > any  idea ?
>> > > > >
>> > > > >
>> > > > >
>> > > > > mehdi
>> > > > >
>> > > > > > From: [email protected]
>> > > > > > To: [email protected]
>> > > > > > Subject: parse-html plugin
>> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
>> > > > > >
>> > > > > >
>> > > > > > hi,
>> > > > > > In the class HtmlParser I changed the 'text' variable to index
>> only
>> > > > > > a part of my html page, and since i did lost lot off outlinks !
>> > > > > >
>> > > > > > ...
>> > > > > >
>> > > > > >  utils.getText(sb,extractIndexableContent(root));  //added on
>> > > > > >  26-01-2011 to extract only text inside <col_centre>
>> > > > > >
>> > > > > >   // utils.getText(sb, root);          // extract text   ---
>> > > > > >   disabled on 26-01-2011-
>> > > > > >
>> > > > > >       text = sb.toString();
>> > > > > >
>> > > > > > ...
>> > > > > >
>> > > > > > i beleived that outlinks are not obtained from the text variable
>> ?!
>> > > > > >  in the same class we could see how outlinks are extracted !
>> > > > > >
>> > > > > >
>> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract
>> > > > > > outlinks
>> > > > > >
>> > > > > >       URL baseTag = utils.getBase(root);
>> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting links...");
>> }
>> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
>> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
>> > > > > >
>> > > > > > can you plz tell me what i did wrong.
>> > > > > >
>> > > > > >
>> > > > > > mehdi
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>

Re: parse-html plugin

Reply via email to