Re: Get original URL from crawldb in case of redirect

Sebastian Nagel Fri, 15 Nov 2013 12:06:06 -0800

Hi Amit,

here the answer for Nutch 1.7
(or are you using 2.x?):


Every URL is stored in CrawlDb even with
  http.redirect.max = 10

For redirects, the target URL is stored in CrawlDatum's
metadata under key _pst_ (protocol status):

http://issues.apache.org/jira/browse/NUTCH      Version: 7
Status: 4 (db_redir_temp)
Fetch time: Sun Dec 15 20:38:53 CET 2013
Modified time: Fri Nov 15 20:38:53 CET 2013
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.00941915
Signature: null
Metadata:
        Content-Type=text/html
        _maxdepth_=1000
        _pst_=temp_moved(13), lastModified=0: 
https://issues.apache.org/jira/browse/NUTCH
        _depth_=2

Sebastian

On 11/14/2013 12:56 PM, Amit Sela wrote:
> Hi all,
> 
> I'm readin the crawldb as CrawledPage and I see the fetched URL, content
> etc.
> In case of a redirection (I allow 10 redirections in nutch-site.xml) the
> fetched URL is not the original URL the Fetcher turned to, and I would like
> to get that as well.
> 
> Does nutch store it somewhere, I'm basically looking for mapping between
> URLs attempted to fetch and actually fetched.
> 
> Thanks,
> 
> Amit.
>

Re: Get original URL from crawldb in case of redirect

Reply via email to