Hi Julien

Thank you for your instruction. I finally got it working.  One more
question, though.
I want to read the url filter regular expressions from Oracle instead of
crawl-urlfilter.txt.
I tried to modify org.apache.nutch.urlfilter.api.RegexURLFilter.java  and
URLFilters.java to insert from oracle, but it didn't work.
Any advice will be appreciated.

Thank you.



2010/5/25 Julien Nioche <[email protected]>

> Hi Eric
>
> You'll need to modify the class o.a.n.crawl.Injector in order to do that
> and
> replace the first map-reduce job in order to generate a sequencefile of
> crawldatum objects straight from Oracle. The second mapred job should work
> as is.
>
> J.
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> On 25 May 2010 13:46, eric park <[email protected]> wrote:
>
> > Hello guys,
> >
> > I'm trying to get rid of the url injection text file and read the
> starting
> > urls from oracle database.
> > It seems that nutch is integrated tightly with hadoop, and I cannot find
> > the
> > way to modify this mechanism.
> > Anyone tried a similar modification?
> >
> > Thank you.
> >
>

Reply via email to