Thanks for the patch Lewis. Where do i make the pom.xml changes i cant find the file?
Also in 1.6 if i give the below command it returns the html content ./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch -nogenerate -noparse -noparsedata -noparsetext I havent built the patch changes as i cant find pom.xml file. lewis john mcgibbney wrote > https://issues.apache.org/jira/browse/NUTCH-1528 > > This is the mongodb indexer patch ported to trunk. > > Can I mention that there is usually no time line on these things e.g. > feature requests. > I'm sure you can appreciate that we are all extremely busy at work with an > array of other things so if it takes a bit of time, then thats OK. The > world goes on and keeps spinning. Even if we are getting bombarded by > meteorites in Russia!!! > > Please check the patch and out comment accordingly. > > Regarding your issue with regards to the full page content, I am not sure > if this is currently available in Nutch trunk with out you writing some > code. > Full html markup is certainly stored in 2.x... but I don't know whether > you > are prepared to move to 2.x for your operations? > > hth > Lewis > > On Fri, Feb 15, 2013 at 1:58 AM, peterbarretto < > peterbarretto08@ > >wrote: > >> Hi Lewis, >> >> Is this patch done?? >> >> >> lewis john mcgibbney wrote >> > Hi, >> > Once I get access to my office I am going to build the patches from >> trunk. >> > Is it trunk that you are using? >> > Thanks >> > Lewis >> > >> > On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto < >> >> > peterbarretto08@ >> >> > >wrote: >> > >> >> Hi Lewis, >> >> >> >> I managed to get the code working by adding the below function to >> >> MongodbWriter.java in the public class MongodbWriter implements >> >> NutchIndexWriter :- >> >> >> >> public void delete(String key) throws IOException{ >> >> return; >> >> } >> >> >> >> And the crawled data was getting stored in mongodb. >> >> The only issue was it was storing only the text of the page and not >> the >> >> full >> >> html content of the page. >> >> How do i store the full html content of the page also? >> >> Hope to see the patches soon. >> >> Thanks >> >> >> >> >> >> >> >> lewis john mcgibbney wrote >> >> > Certainly. >> >> > I am currently reviewing the code and will hopefully have patches >> for >> >> > Nutch trunk cooked up for tomorrow. >> >> > I'll update this thread likewise. >> >> > Thanks >> >> > Lewis >> >> > >> >> > On Wed, Jan 30, 2013 at 10:02 PM, peterbarretto >> >> > < >> >> >> >> > peterbarretto08@ >> >> >> >> > > wrote: >> >> >> Hi Lewis, >> >> >> >> >> >> I am new to java and i dont know how to inherit all public methods >> >> from >> >> >> NutchIndexWriter >> >> >> Can you help me with that? Then i can rebuild and check if it >> works. >> >> >> >> >> >> >> >> >> lewis john mcgibbney wrote >> >> >>> As you will see the code has not been amended in a year or so. >> >> >>> The positive side is that you only seem to be getting one issue >> with >> >> >>> javac >> >> >>> >> >> >>> On Tue, Jan 29, 2013 at 8:39 PM, peterbarretto < >> >> >> >> >> >>> peterbarretto08@ >> >> >> >> >> >>> >wrote: >> >> >>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >> C:\nutch-16\src\java\org\apache\nutch\indexer\mongodb\MongodbWriter.java:18: >> >> >>>> error: MongodbWriter is not abstract and does not override >> abstract >> >> >>>> method >> >> >>>> delete(String) in NutchIndexWriter >> >> >>>> [javac] public class MongodbWriter implements >> NutchIndexWriter{ >> >> >>>> >> >> >>>> Sort this error out by inheriting all public methods from >> >> >>>> NutchIndexWriter >> >> >>> for starts. I take it you are not developing from within Eclipse? >> As >> >> >>> this >> >> >>> would have been flagged up immediately. This should at least >> enable >> >> you >> >> >>> to >> >> >>> compile the code. >> >> >>> >> >> >>> >> >> >>>> >> >> >>>> I have already crawled some urls now and i need to move those to >> >> >>>> mongodb. >> >> >>>> Is >> >> >>>> there a easy to use code to do that? >> >> >>> >> >> >>> >> >> >>> Not apart from hacking the code as you are already doing. The code >> >> you >> >> >>> are >> >> >>> pulling is not part of the official nutch codebase and to be >> honest >> a >> >> >>> few >> >> >>> of us didn't even know about it until you brought it to our >> attention >> >> >>> :0) >> >> >>> >> >> >>> There is no silver bullet here, just take your time and we will >> get >> >> it >> >> >>> working. >> >> >>> Lewis >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> View this message in context: >> >> >> >> >> >> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4037621.html >> >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > >> >> > >> >> > >> >> > -- >> >> > Lewis >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4039401.html >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> >> > >> > >> > >> > -- >> > *Lewis* >> >> >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040596.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > > > > -- > *Lewis* -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-page-content-of-crawled-pages-tp1944330p4040944.html Sent from the Nutch - User mailing list archive at Nabble.com.