RE: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Markus Jelsma Mon, 08 Jul 2013 13:36:31 -0700

Processing the logs would be easy but since you need some metadata your 
probably need to hack into the Fetcher.java code. The fetcher has several inner 
classes but you'd need the FetcherThread class which is responsible for the 
actual download and anything else that needs to be done there.  If you also 
need metadata that requires parsing the file you need to configure the fetcher 
to do parsing as well.


http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup

The record is fetched around #683. The output() method writes the stuff to the 
segment and does the optional parsing of the record. Parsing is done around 
#960.

In output() you could communicate with your DB although it's not the best place 
but easy to test. FetcherOutputFormat is more suitable for writing data. Also 
read about how to talk to DB's in Hadoop land.

Cheers

 
-----Original message-----
> From:S.L <[email protected]>
> Sent: Monday 8th July 2013 22:11
> To: [email protected]
> Subject: Intercept the current URL that Nutch is about to crawl in Nutch 1.7
> 
> Hello All,
> 
> I am a new Nutch user , I need to be able to get every URL that Nutch is
> crawling with in a session and insert the URL into a MySQL database along
> with some other metadata , I am using Nutch 1.7 and have set up the project
> in Eclipse .Can anyone please give me guidance on which class/classes I
> would need to modify to get the URL in the current session  and insert it
> into the database?
> 
> Any help would be greatly appreciated.
> 
> Thank You in advance.
>

RE: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Reply via email to