Re: How to get page content of crawled pages

Klemens Muthmann Mon, 22 Nov 2010 07:31:38 -0800

Hi,

Super. That works. Thank you. I thereby also found the class that showshow to achieve this within Java code, which isorg.apache.nutch.segment.SegmentReader.


Thanks again and bye
    Klemens

Am 22.11.2010 10:49, schrieb Hannes Carl Meyer:

Hi Klemens,

you should run ./bin/nutch readseg!

For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
-nofetch -nogenerate -noparse -noparsedata -noparsetex

Kind Regards from Hannover

Hannes

On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann<
[email protected]>  wrote:

Hi,

I did a small crawl of some pages on the web and want to geht the raw HTML
content of these pages now. Reading the documentation in the wiki I guess
this content might be somewhere under
crawl/segments/20101122071139/content/part-00000.

I also guess I can access this content using the Hadoop API like described
here: http://wiki.apache.org/nutch/Getting_Started

However I have absolutely no idea how to configure:

MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);


The Hadoop documentation is not very helpful either. May someone please
point me in the right direction to get the page content?

Thank you and regards
    Klemens Muthmann



--
--------------------------------
Dipl.-Medieninf., Klemens Muthmann
Wissenschaftlicher Mitarbeiter

Technische Universität Dresden
Fakultät Informatik
Institut für Systemarchitektur
Lehrstuhl Rechnernetze
01062 Dresden
Tel.: +49 (351) 463-38214
Fax: +49 (351) 463-38251
E-Mail: [email protected]
--------------------------------

Re: How to get page content of crawled pages

Reply via email to