check the parse-html plugin source code.

On Sat, Dec 18, 2010 at 12:46 PM, Paul Lypaczewski <[email protected]
> wrote:

> readseg works very well. Another question is: how can I do it in a
> programmable way? I mean, I would like to insert to a filter class to dump
> out the original html content, before the content is fed to Tokenizer. Which
> part of the source code should I start to read?
>
>
> --- On Fri, 12/17/10, Paul Lypaczewski <[email protected]> wrote:
>
> From: Paul Lypaczewski <[email protected]>
> Subject: Re: How to dump the crawled Html pages?
> To: [email protected], [email protected]
> Received: Friday, December 17, 2010, 10:50 PM
>
> Hannes
> Thank you very much!
> Paul
>
> --- On Fri, 12/17/10, Hannes Carl Meyer <[email protected]> wrote:
>
> From: Hannes Carl Meyer <[email protected]>
> Subject: Re: How to dump the crawled Html pages?
> To: [email protected]
> Received: Friday, December 17, 2010, 7:37 PM
>
> Hi,
>
> for example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
> -nofetch -nogenerate -noparse -noparsedata -noparsetex
>
> Regards
>
> Hannes
>
> On Fri, Dec 17, 2010 at 8:32 PM, Paul Lypaczewski
> <[email protected]>wrote:
>
> > Thanks, Markus. I will check it out.
> >
> > --- On Fri, 12/17/10, Markus Jelsma <[email protected]> wrote:
> >
> > From: Markus Jelsma <[email protected]>
> > Subject: Re: How to dump the crawled Html pages?
> > To: [email protected]
> > Cc: "Paul Lypaczewski" <[email protected]>
> > Received: Friday, December 17, 2010, 7:25 PM
> >
> > Hi,
> >
> > Check out the readseg command.
> >
> > Cheers,
> >
> > > Hi
> > > I am new to Nutch. I just started to use Nutch to crawl an intranet and
> > > extract a certain field from the html pages. The first step I would
> like
> > > to do is to dump all the html pages to a directory. I guess I should
> add
> > a
> > > filter class to do it, but I have no idea where should I start. Can
> > > someone give me some advice on how to start or which class's source
> code
> > I
> > > should read? Thank you very much!
> > > Paul
> >
> >
> >
>
>
> --
>
> https://www.xing.com/profile/HannesCarl_Meyer
> http://de.linkedin.com/in/hannescarlmeyer
> http://twitter.com/hannescarlmeyer
>
>
>
>
>


-- 
Ammar Shadiq
http://ammarshadiq.web.id

Reply via email to