Hi Julien Thank you for your instruction. I finally got it working. One more question, though. I want to read the url filter regular expressions from Oracle instead of crawl-urlfilter.txt. I tried to modify org.apache.nutch.urlfilter.api.RegexURLFilter.java and URLFilters.java to insert from oracle, but it didn't work. Any advice will be appreciated.
Thank you. 2010/5/25 Julien Nioche <[email protected]> > Hi Eric > > You'll need to modify the class o.a.n.crawl.Injector in order to do that > and > replace the first map-reduce job in order to generate a sequencefile of > crawldatum objects straight from Oracle. The second mapred job should work > as is. > > J. > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > On 25 May 2010 13:46, eric park <[email protected]> wrote: > > > Hello guys, > > > > I'm trying to get rid of the url injection text file and read the > starting > > urls from oracle database. > > It seems that nutch is integrated tightly with hadoop, and I cannot find > > the > > way to modify this mechanism. > > Anyone tried a similar modification? > > > > Thank you. > > >

