Hi,
Super. That works. Thank you. I thereby also found the class that shows
how to achieve this within Java code, which is
org.apache.nutch.segment.SegmentReader.
Thanks again and bye
Klemens
Am 22.11.2010 10:49, schrieb Hannes Carl Meyer:
Hi Klemens,
you should run ./bin/nutch readseg!
For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
-nofetch -nogenerate -noparse -noparsedata -noparsetex
Kind Regards from Hannover
Hannes
On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann<
[email protected]> wrote:
Hi,
I did a small crawl of some pages on the web and want to geht the raw HTML
content of these pages now. Reading the documentation in the wiki I guess
this content might be somewhere under
crawl/segments/20101122071139/content/part-00000.
I also guess I can access this content using the Hadoop API like described
here: http://wiki.apache.org/nutch/Getting_Started
However I have absolutely no idea how to configure:
MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
The Hadoop documentation is not very helpful either. May someone please
point me in the right direction to get the page content?
Thank you and regards
Klemens Muthmann
--
--------------------------------
Dipl.-Medieninf., Klemens Muthmann
Wissenschaftlicher Mitarbeiter
Technische Universität Dresden
Fakultät Informatik
Institut für Systemarchitektur
Lehrstuhl Rechnernetze
01062 Dresden
Tel.: +49 (351) 463-38214
Fax: +49 (351) 463-38251
E-Mail: [email protected]
--------------------------------