I suppose you can write a custom indexer, to store the data in mongodb instead of solr, I think there is an open repo on github about this.
----- Mensaje original ----- De: "peterbarretto" <peterbarrett...@gmail.com> Para: user@nutch.apache.org Enviados: Martes, 29 de Enero 2013 8:46:04 Asunto: Re: How to get page content of crawled pages Hi Is there a way i can dump the url and url content in mongodb? Klemens Muthmann wrote > Hi, > > Super. That works. Thank you. I thereby also found the class that shows > how to achieve this within Java code, which is > org.apache.nutch.segment.SegmentReader. > > Thanks again and bye > Klemens > > Am 22.11.2010 10:49, schrieb Hannes Carl Meyer: >> Hi Klemens, >> >> you should run ./bin/nutch readseg! >> >> For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder >> -nofetch -nogenerate -noparse -noparsedata -noparsetex >> >> Kind Regards from Hannover >> >> Hannes >> >> On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann< >> > klemens.muthmann@ >> wrote: >> >>> Hi, >>> >>> I did a small crawl of some pages on the web and want to geht the raw >>> HTML >>> content of these pages now. Reading the documentation in the wiki I >>> guess >>> this content might be somewhere under >>> crawl/segments/20101122071139/content/part-00000. >>> >>> I also guess I can access this content using the Hadoop API like >>> described >>> here: http://wiki.apache.org/nutch/Getting_Started >>> >>> However I have absolutely no idea how to configure: >>> >>> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf); >>> >>> >>> The Hadoop documentation is not very helpful either. May someone please >>> point me in the right direction to get the page content? >>> >>> Thank you and regards >>> Klemens Muthmann >>> > > > -- > -------------------------------- > Dipl.-Medieninf., Klemens Muthmann > Wissenschaftlicher Mitarbeiter > > Technische Universität Dresden > Fakultät Informatik > Institut für Systemarchitektur > Lehrstuhl Rechnernetze > 01062 Dresden > Tel.: +49 (351) 463-38214 > Fax: +49 (351) 463-38251 > E-Mail: > klemens.muthmann@ > -------------------------------- -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037023.html Sent from the Nutch - User mailing list archive at Nabble.com.