The best method is to read or dump the contents of your crawldb and work
based on this.

Please have a look on the wiki for using the readdb tool.

On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
[email protected]> wrote:

> Hi,
> I am using Nutch to generate a small dataset of web; dataset on which I am
> planning of running a focused crawler later.
>
> I did a test crawl of and I have the 'segments' folder built up. Now I need
> to get that exact html pages it fetched out of the seed url/s.
>
> Is it possible to create a dataset this way? If so, how do I get those html
> pages?
>
> Thanks a lot!
>



-- 
*Lewis*

Reply via email to