Get Parent of URLs fetched by nutch

blunderboy Tue, 22 May 2012 03:41:11 -0700

As I run Apache Nutch 1.4 crawler, I want to store some additional
information. I want to store the parent of every URL.


For example, I want to crawl a page a.html that has 2 anchor links to b.html
and c.html So when I crawl a.html, I should get something like this :-

a.html null
b.html a.html
c.html a.html

I want to store something like this. I have read how nutch works and have
run nutch in eclipse too. I also read fetcher.java and logged where it
fetched content. But I got no success in knowing where Nutch fetches the
child URLs of a given page. I think this step takes place after parsing
step.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Get-Parent-of-URLs-fetched-by-nutch-tp3985369.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Get Parent of URLs fetched by nutch

Reply via email to