Re: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

S.L Mon, 08 Jul 2013 13:40:31 -0700

On a second thought I am also considering Solr instead of the MySQL DB ,
you mentioned that I need to look into how to talk to DBs in Hadoop land ,
what if I have to talk to Solr from Nutch  ?



On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma <[email protected]>wrote:

> Processing the logs would be easy but since you need some metadata your
> probably need to hack into the Fetcher.java code. The fetcher has several
> inner classes but you'd need the FetcherThread class which is responsible
> for the actual download and anything else that needs to be done there.  If
> you also need metadata that requires parsing the file you need to configure
> the fetcher to do parsing as well.
>
>
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
>
> The record is fetched around #683. The output() method writes the stuff to
> the segment and does the optional parsing of the record. Parsing is done
> around #960.
>
> In output() you could communicate with your DB although it's not the best
> place but easy to test. FetcherOutputFormat is more suitable for writing
> data. Also read about how to talk to DB's in Hadoop land.
>
> Cheers
>
>
> -----Original message-----
> > From:S.L <[email protected]>
> > Sent: Monday 8th July 2013 22:11
> > To: [email protected]
> > Subject: Intercept the current URL that Nutch is about to crawl in Nutch
> 1.7
> >
> > Hello All,
> >
> > I am a new Nutch user , I need to be able to get every URL that Nutch is
> > crawling with in a session and insert the URL into a MySQL database along
> > with some other metadata , I am using Nutch 1.7 and have set up the
> project
> > in Eclipse .Can anyone please give me guidance on which class/classes I
> > would need to modify to get the URL in the current session  and insert it
> > into the database?
> >
> > Any help would be greatly appreciated.
> >
> > Thank You in advance.
> >
>

Re: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Reply via email to