check the parse-html plugin source code. On Sat, Dec 18, 2010 at 12:46 PM, Paul Lypaczewski <[email protected] > wrote:
> readseg works very well. Another question is: how can I do it in a > programmable way? I mean, I would like to insert to a filter class to dump > out the original html content, before the content is fed to Tokenizer. Which > part of the source code should I start to read? > > > --- On Fri, 12/17/10, Paul Lypaczewski <[email protected]> wrote: > > From: Paul Lypaczewski <[email protected]> > Subject: Re: How to dump the crawled Html pages? > To: [email protected], [email protected] > Received: Friday, December 17, 2010, 10:50 PM > > Hannes > Thank you very much! > Paul > > --- On Fri, 12/17/10, Hannes Carl Meyer <[email protected]> wrote: > > From: Hannes Carl Meyer <[email protected]> > Subject: Re: How to dump the crawled Html pages? > To: [email protected] > Received: Friday, December 17, 2010, 7:37 PM > > Hi, > > for example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder > -nofetch -nogenerate -noparse -noparsedata -noparsetex > > Regards > > Hannes > > On Fri, Dec 17, 2010 at 8:32 PM, Paul Lypaczewski > <[email protected]>wrote: > > > Thanks, Markus. I will check it out. > > > > --- On Fri, 12/17/10, Markus Jelsma <[email protected]> wrote: > > > > From: Markus Jelsma <[email protected]> > > Subject: Re: How to dump the crawled Html pages? > > To: [email protected] > > Cc: "Paul Lypaczewski" <[email protected]> > > Received: Friday, December 17, 2010, 7:25 PM > > > > Hi, > > > > Check out the readseg command. > > > > Cheers, > > > > > Hi > > > I am new to Nutch. I just started to use Nutch to crawl an intranet and > > > extract a certain field from the html pages. The first step I would > like > > > to do is to dump all the html pages to a directory. I guess I should > add > > a > > > filter class to do it, but I have no idea where should I start. Can > > > someone give me some advice on how to start or which class's source > code > > I > > > should read? Thank you very much! > > > Paul > > > > > > > > > -- > > https://www.xing.com/profile/HannesCarl_Meyer > http://de.linkedin.com/in/hannescarlmeyer > http://twitter.com/hannescarlmeyer > > > > > -- Ammar Shadiq http://ammarshadiq.web.id

