Re: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

S.L Sun, 14 Jul 2013 13:55:04 -0700

Marcus,

I need to modify the data i.e the content of the pages before populating
the Solr Index, if I use the solrIndex command to do that what classes
would I need to change inorder to do that.?


Instead I was thinking of the original question I asked as an option i.e to
intercept the url download the content and extract the data and update the
Solr schema from there.

Will one option have any advantage over the other ?

Thanks.


On Mon, Jul 8, 2013 at 4:42 PM, Markus Jelsma <[email protected]>wrote:

> Well, that would be easier indeed. Nutch will fetch, parse (+ optional
> custom parse filters) and index (+ optional custom indexing filters) to any
> available indexing backend (Solr, ES). Check the Nutch tutorial on the wiki.
>
> -----Original message-----
> > From:S.L <[email protected]>
> > Sent: Monday 8th July 2013 22:39
> > To: [email protected]
> > Subject: Re: Intercept the current URL that Nutch is about to crawl in
> Nutch 1.7
> >
> > On a second thought I am also considering Solr instead of the MySQL DB ,
> > you mentioned that I need to look into how to talk to DBs in Hadoop land
> ,
> > what if I have to talk to Solr from Nutch  ?
> >
> >
> > On Mon, Jul 8, 2013 at 4:34 PM, Markus Jelsma <
> [email protected]>wrote:
> >
> > > Processing the logs would be easy but since you need some metadata your
> > > probably need to hack into the Fetcher.java code. The fetcher has
> several
> > > inner classes but you'd need the FetcherThread class which is
> responsible
> > > for the actual download and anything else that needs to be done there.
>  If
> > > you also need metadata that requires parsing the file you need to
> configure
> > > the fetcher to do parsing as well.
> > >
> > >
> > >
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup
> > >
> > > The record is fetched around #683. The output() method writes the
> stuff to
> > > the segment and does the optional parsing of the record. Parsing is
> done
> > > around #960.
> > >
> > > In output() you could communicate with your DB although it's not the
> best
> > > place but easy to test. FetcherOutputFormat is more suitable for
> writing
> > > data. Also read about how to talk to DB's in Hadoop land.
> > >
> > > Cheers
> > >
> > >
> > > -----Original message-----
> > > > From:S.L <[email protected]>
> > > > Sent: Monday 8th July 2013 22:11
> > > > To: [email protected]
> > > > Subject: Intercept the current URL that Nutch is about to crawl in
> Nutch
> > > 1.7
> > > >
> > > > Hello All,
> > > >
> > > > I am a new Nutch user , I need to be able to get every URL that
> Nutch is
> > > > crawling with in a session and insert the URL into a MySQL database
> along
> > > > with some other metadata , I am using Nutch 1.7 and have set up the
> > > project
> > > > in Eclipse .Can anyone please give me guidance on which
> class/classes I
> > > > would need to modify to get the URL in the current session  and
> insert it
> > > > into the database?
> > > >
> > > > Any help would be greatly appreciated.
> > > >
> > > > Thank You in advance.
> > > >
> > >
> >
>

Re: Intercept the current URL that Nutch is about to crawl in Nutch 1.7

Reply via email to