Eric, I would suggest not to modify the RegexURLFilter and URLFilters.java. Instead change crawl-Urlfilter.txt to pass all the urls ( i.e. insert +. ) and write a custom plugin that runs after crawl-urlfilter which reads from oracle. You'll need to add your plugin to plugin.includes property in nutch-site.xml/nutch-default.xml and to the urlfilter.order property. This way you still can synchronize with the future nutch releases without considerable merge operations.
Harry On Wed, May 26, 2010 at 3:23 PM, eric park <[email protected]> wrote: > Hi Julien > > Thank you for your instruction. I finally got it working. One more > question, though. > I want to read the url filter regular expressions from Oracle instead of > crawl-urlfilter.txt. > I tried to modify org.apache.nutch.urlfilter.api.RegexURLFilter.java and > URLFilters.java to insert from oracle, but it didn't work. > Any advice will be appreciated. > > Thank you. > > > > 2010/5/25 Julien Nioche <[email protected]> > > > Hi Eric > > > > You'll need to modify the class o.a.n.crawl.Injector in order to do that > > and > > replace the first map-reduce job in order to generate a sequencefile of > > crawldatum objects straight from Oracle. The second mapred job should > work > > as is. > > > > J. > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > > > On 25 May 2010 13:46, eric park <[email protected]> wrote: > > > > > Hello guys, > > > > > > I'm trying to get rid of the url injection text file and read the > > starting > > > urls from oracle database. > > > It seems that nutch is integrated tightly with hadoop, and I cannot > find > > > the > > > way to modify this mechanism. > > > Anyone tried a similar modification? > > > > > > Thank you. > > > > > >

