Well, that would be easier indeed. Nutch will fetch, parse (+ optional custom
parse filters) and index (+ optional custom indexing filters) to any available
indexing backend (Solr, ES). Check the Nutch tutorial on the wiki.
-----Original message-----
> From:S.L <[email protected]>
> Sent: Monday 8th July 2013 22:39
> To: [email protected]
> Subject: Re: Intercept the current URL that Nutch is about to crawl in Nutch
> 1.7
>
> On a second thought I am also considering Solr instead of the MySQL DB ,
> you mentioned that I need to look into how to talk to DBs in Hadoop land ,
> what if I have to talk to Solr from Nutch ?
>
>
> On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma
> <[email protected]>wrote:
>
> > Processing the logs would be easy but since you need some metadata your
> > probably need to hack into the Fetcher.java code. The fetcher has several
> > inner classes but you'd need the FetcherThread class which is responsible
> > for the actual download and anything else that needs to be done there. If
> > you also need metadata that requires parsing the file you need to configure
> > the fetcher to do parsing as well.
> >
> >
> > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
> >
> > The record is fetched around #683. The output() method writes the stuff to
> > the segment and does the optional parsing of the record. Parsing is done
> > around #960.
> >
> > In output() you could communicate with your DB although it's not the best
> > place but easy to test. FetcherOutputFormat is more suitable for writing
> > data. Also read about how to talk to DB's in Hadoop land.
> >
> > Cheers
> >
> >
> > -----Original message-----
> > > From:S.L <[email protected]>
> > > Sent: Monday 8th July 2013 22:11
> > > To: [email protected]
> > > Subject: Intercept the current URL that Nutch is about to crawl in Nutch
> > 1.7
> > >
> > > Hello All,
> > >
> > > I am a new Nutch user , I need to be able to get every URL that Nutch is
> > > crawling with in a session and insert the URL into a MySQL database along
> > > with some other metadata , I am using Nutch 1.7 and have set up the
> > project
> > > in Eclipse .Can anyone please give me guidance on which class/classes I
> > > would need to modify to get the URL in the current session and insert it
> > > into the database?
> > >
> > > Any help would be greatly appreciated.
> > >
> > > Thank You in advance.
> > >
> >
>