Hi Amit, here the answer for Nutch 1.7 (or are you using 2.x?):
Every URL is stored in CrawlDb even with http.redirect.max = 10 For redirects, the target URL is stored in CrawlDatum's metadata under key _pst_ (protocol status): http://issues.apache.org/jira/browse/NUTCH Version: 7 Status: 4 (db_redir_temp) Fetch time: Sun Dec 15 20:38:53 CET 2013 Modified time: Fri Nov 15 20:38:53 CET 2013 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.00941915 Signature: null Metadata: Content-Type=text/html _maxdepth_=1000 _pst_=temp_moved(13), lastModified=0: https://issues.apache.org/jira/browse/NUTCH _depth_=2 Sebastian On 11/14/2013 12:56 PM, Amit Sela wrote: > Hi all, > > I'm readin the crawldb as CrawledPage and I see the fetched URL, content > etc. > In case of a redirection (I allow 10 redirections in nutch-site.xml) the > fetched URL is not the original URL the Fetcher turned to, and I would like > to get that as well. > > Does nutch store it somewhere, I'm basically looking for mapping between > URLs attempted to fetch and actually fetched. > > Thanks, > > Amit. >

