Re: URL Injection from oracle

Harry Nutch Wed, 26 May 2010 17:30:42 -0700

Eric,
I would suggest not to modify the RegexURLFilter and URLFilters.java.
Instead change crawl-Urlfilter.txt to pass all the urls ( i.e. insert +.   )
and write a custom plugin that runs after crawl-urlfilter which reads from
oracle.
You'll need to add your plugin to plugin.includes property in
nutch-site.xml/nutch-default.xml and to the
urlfilter.order property.
This way you still can synchronize with the future nutch releases without
considerable merge operations.


Harry




On Wed, May 26, 2010 at 3:23 PM, eric park <[email protected]> wrote:

> Hi Julien
>
> Thank you for your instruction. I finally got it working.  One more
> question, though.
> I want to read the url filter regular expressions from Oracle instead of
> crawl-urlfilter.txt.
> I tried to modify org.apache.nutch.urlfilter.api.RegexURLFilter.java  and
> URLFilters.java to insert from oracle, but it didn't work.
> Any advice will be appreciated.
>
> Thank you.
>
>
>
> 2010/5/25 Julien Nioche <[email protected]>
>
> > Hi Eric
> >
> > You'll need to modify the class o.a.n.crawl.Injector in order to do that
> > and
> > replace the first map-reduce job in order to generate a sequencefile of
> > crawldatum objects straight from Oracle. The second mapred job should
> work
> > as is.
> >
> > J.
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
> > On 25 May 2010 13:46, eric park <[email protected]> wrote:
> >
> > > Hello guys,
> > >
> > > I'm trying to get rid of the url injection text file and read the
> > starting
> > > urls from oracle database.
> > > It seems that nutch is integrated tightly with hadoop, and I cannot
> find
> > > the
> > > way to modify this mechanism.
> > > Anyone tried a similar modification?
> > >
> > > Thank you.
> > >
> >
>

Re: URL Injection from oracle

Reply via email to