Re: How to get page content of crawled pages

Hannes Carl Meyer Mon, 22 Nov 2010 01:50:01 -0800

Hi Klemens,

you should run ./bin/nutch readseg!


For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
-nofetch -nogenerate -noparse -noparsedata -noparsetex

Kind Regards from Hannover

Hannes

On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann <
[email protected]> wrote:

> Hi,
>
> I did a small crawl of some pages on the web and want to geht the raw HTML
> content of these pages now. Reading the documentation in the wiki I guess
> this content might be somewhere under
> crawl/segments/20101122071139/content/part-00000.
>
> I also guess I can access this content using the Hadoop API like described
> here: http://wiki.apache.org/nutch/Getting_Started
>
> However I have absolutely no idea how to configure:
>
> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
>
>
> The Hadoop documentation is not very helpful either. May someone please
> point me in the right direction to get the page content?
>
> Thank you and regards
>    Klemens Muthmann
>

Re: How to get page content of crawled pages

Reply via email to