Re: How to get page content of crawled pages

Jorge Luis Betancourt Gonzalez Tue, 29 Jan 2013 12:29:09 -0800

I suppose you can write a custom indexer, to store the data in mongodb instead 
of solr, I think there is an open repo on github about this.


----- Mensaje original -----
De: "peterbarretto" <peterbarrett...@gmail.com>
Para: user@nutch.apache.org
Enviados: Martes, 29 de Enero 2013 8:46:04
Asunto: Re: How to get page content of crawled pages

Hi

Is there a way i can dump the url and url content in mongodb?


Klemens Muthmann wrote
> Hi,
>
> Super. That works. Thank you. I thereby also found the class that shows
> how to achieve this within Java code, which is
> org.apache.nutch.segment.SegmentReader.
>
> Thanks again and bye
>      Klemens
>
> Am 22.11.2010 10:49, schrieb Hannes Carl Meyer:
>> Hi Klemens,
>>
>> you should run ./bin/nutch readseg!
>>
>> For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder
>> -nofetch -nogenerate -noparse -noparsedata -noparsetex
>>
>> Kind Regards from Hannover
>>
>> Hannes
>>
>> On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann<
>>

> klemens.muthmann@

>>  wrote:
>>
>>> Hi,
>>>
>>> I did a small crawl of some pages on the web and want to geht the raw
>>> HTML
>>> content of these pages now. Reading the documentation in the wiki I
>>> guess
>>> this content might be somewhere under
>>> crawl/segments/20101122071139/content/part-00000.
>>>
>>> I also guess I can access this content using the Hadoop API like
>>> described
>>> here: http://wiki.apache.org/nutch/Getting_Started
>>>
>>> However I have absolutely no idea how to configure:
>>>
>>> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
>>>
>>>
>>> The Hadoop documentation is not very helpful either. May someone please
>>> point me in the right direction to get the page content?
>>>
>>> Thank you and regards
>>>     Klemens Muthmann
>>>
>
>
> --
> --------------------------------
> Dipl.-Medieninf., Klemens Muthmann
> Wissenschaftlicher Mitarbeiter
>
> Technische Universität Dresden
> Fakultät Informatik
> Institut für Systemarchitektur
> Lehrstuhl Rechnernetze
> 01062 Dresden
> Tel.: +49 (351) 463-38214
> Fax: +49 (351) 463-38251
> E-Mail:

> klemens.muthmann@

> --------------------------------





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037023.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get page content of crawled pages

Reply via email to