Re: How to dump the crawled Html pages?

Hannes Carl Meyer Fri, 17 Dec 2010 11:37:54 -0800

Hi,

for example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
-nofetch -nogenerate -noparse -noparsedata -noparsetex


Regards

Hannes

On Fri, Dec 17, 2010 at 8:32 PM, Paul Lypaczewski
<[email protected]>wrote:

> Thanks, Markus. I will check it out.
>
> --- On Fri, 12/17/10, Markus Jelsma <[email protected]> wrote:
>
> From: Markus Jelsma <[email protected]>
> Subject: Re: How to dump the crawled Html pages?
> To: [email protected]
> Cc: "Paul Lypaczewski" <[email protected]>
> Received: Friday, December 17, 2010, 7:25 PM
>
> Hi,
>
> Check out the readseg command.
>
> Cheers,
>
> > Hi
> > I am new to Nutch. I just started to use Nutch to crawl an intranet and
> > extract a certain field from the html pages. The first step I would like
> > to do is to dump all the html pages to a directory. I guess I should add
> a
> > filter class to do it, but I have no idea where should I start. Can
> > someone give me some advice on how to start or which class's source code
> I
> > should read? Thank you very much!
> > Paul
>
>
>


-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer

Re: How to dump the crawled Html pages?

Reply via email to