On a second thought I am also considering Solr instead of the MySQL DB , you mentioned that I need to look into how to talk to DBs in Hadoop land , what if I have to talk to Solr from Nutch ?
On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma <[email protected]>wrote: > Processing the logs would be easy but since you need some metadata your > probably need to hack into the Fetcher.java code. The fetcher has several > inner classes but you'd need the FetcherThread class which is responsible > for the actual download and anything else that needs to be done there. If > you also need metadata that requires parsing the file you need to configure > the fetcher to do parsing as well. > > > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup > > The record is fetched around #683. The output() method writes the stuff to > the segment and does the optional parsing of the record. Parsing is done > around #960. > > In output() you could communicate with your DB although it's not the best > place but easy to test. FetcherOutputFormat is more suitable for writing > data. Also read about how to talk to DB's in Hadoop land. > > Cheers > > > -----Original message----- > > From:S.L <[email protected]> > > Sent: Monday 8th July 2013 22:11 > > To: [email protected] > > Subject: Intercept the current URL that Nutch is about to crawl in Nutch > 1.7 > > > > Hello All, > > > > I am a new Nutch user , I need to be able to get every URL that Nutch is > > crawling with in a session and insert the URL into a MySQL database along > > with some other metadata , I am using Nutch 1.7 and have set up the > project > > in Eclipse .Can anyone please give me guidance on which class/classes I > > would need to modify to get the URL in the current session and insert it > > into the database? > > > > Any help would be greatly appreciated. > > > > Thank You in advance. > > >

