I have tried the repo https://github.com/ctjmorgan/nutch-mongodb-indexer and it does not work I guess this is not working as it is mentioned it is for nutch 1.3 and i am using 1.6
I get the below output when i try to rebuild :- Buildfile: C:\nutch-16\build.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-probe-antlib: ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. ivy-download-unchecked: ivy-init-antlib: ivy-init: init: clean-lib: [delete] Deleting directory C:\nutch-16\build\lib resolve-default: [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ :: [ivy:resolve] :: loading settings :: file = C:\nutch-16\ivy\ivysettings.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found. copy-libs: compile-core: [javac] C:\nutch-16\build.xml:96: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to C:\nutch-16\build\classes [javac] warning: [path] bad path element "C:\nutch-16\build\lib\activation.jar": no such file or directory [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:7: warning: [deprecation] JobConf in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.JobConf; [javac] ^ [javac] C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: error: MongodbWriter is not abstract and does not override abstract method delete(String) in NutchIndexWriter [javac] public class MongodbWriter implements NutchIndexWriter{ [javac] ^ [javac] C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:23: warning: [deprecation] JobConf in org.apache.hadoop.mapred has been deprecated [javac] public void open(JobConf job, String name) throws IOException { [javac] ^ [javac] 1 error [javac] 4 warnings I have already crawled some urls now and i need to move those to mongodb. Is there a easy to use code to do that? I am new to java so will require all the steps of how to add the code and all. Jorge Luis Betancourt Gonzalez wrote > I suppose you can write a custom indexer, to store the data in mongodb > instead of solr, I think there is an open repo on github about this. > > ----- Mensaje original ----- > De: "peterbarretto" < > peterbarretto08@ > > > Para: > user@.apache > Enviados: Martes, 29 de Enero 2013 8:46:04 > Asunto: Re: How to get page content of crawled pages > > Hi > > Is there a way i can dump the url and url content in mongodb? > > > Klemens Muthmann wrote >> Hi, >> >> Super. That works. Thank you. I thereby also found the class that shows >> how to achieve this within Java code, which is >> org.apache.nutch.segment.SegmentReader. >> >> Thanks again and bye >> Klemens >> >> Am 22.11.2010 10:49, schrieb Hannes Carl Meyer: >>> Hi Klemens, >>> >>> you should run ./bin/nutch readseg! >>> >>> For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder >>> -nofetch -nogenerate -noparse -noparsedata -noparsetex >>> >>> Kind Regards from Hannover >>> >>> Hannes >>> >>> On Mon, Nov 22, 2010 at 9:23 AM, Klemens Muthmann< >>> > >> klemens.muthmann@ > >>> wrote: >>> >>>> Hi, >>>> >>>> I did a small crawl of some pages on the web and want to geht the raw >>>> HTML >>>> content of these pages now. Reading the documentation in the wiki I >>>> guess >>>> this content might be somewhere under >>>> crawl/segments/20101122071139/content/part-00000. >>>> >>>> I also guess I can access this content using the Hadoop API like >>>> described >>>> here: http://wiki.apache.org/nutch/Getting_Started >>>> >>>> However I have absolutely no idea how to configure: >>>> >>>> MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf); >>>> >>>> >>>> The Hadoop documentation is not very helpful either. May someone please >>>> point me in the right direction to get the page content? >>>> >>>> Thank you and regards >>>> Klemens Muthmann >>>> >> >> >> -- >> -------------------------------- >> Dipl.-Medieninf., Klemens Muthmann >> Wissenschaftlicher Mitarbeiter >> >> Technische Universität Dresden >> Fakultät Informatik >> Institut für Systemarchitektur >> Lehrstuhl Rechnernetze >> 01062 Dresden >> Tel.: +49 (351) 463-38214 >> Fax: +49 (351) 463-38251 >> E-Mail: > >> klemens.muthmann@ > >> -------------------------------- > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037023.html > Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037283.html Sent from the Nutch - User mailing list archive at Nabble.com.