Would _pst_ exist in metadata even if I'm crawling with: db.update.additions.allowed=false
(I have a use case where I don't really crawl, but actually just fetch, and sometimes the list is too long for one execution so I have to re-execute on the same crawlDB but I don't want to crawl outside the seed list). Thanks. On Fri, Nov 15, 2013 at 10:05 PM, Sebastian Nagel < [email protected]> wrote: > Hi Amit, > > here the answer for Nutch 1.7 > (or are you using 2.x?): > > Every URL is stored in CrawlDb even with > http.redirect.max = 10 > > For redirects, the target URL is stored in CrawlDatum's > metadata under key _pst_ (protocol status): > > http://issues.apache.org/jira/browse/NUTCH Version: 7 > Status: 4 (db_redir_temp) > Fetch time: Sun Dec 15 20:38:53 CET 2013 > Modified time: Fri Nov 15 20:38:53 CET 2013 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 0.00941915 > Signature: null > Metadata: > Content-Type=text/html > _maxdepth_=1000 > _pst_=temp_moved(13), lastModified=0: > https://issues.apache.org/jira/browse/NUTCH > _depth_=2 > > Sebastian > > On 11/14/2013 12:56 PM, Amit Sela wrote: > > Hi all, > > > > I'm readin the crawldb as CrawledPage and I see the fetched URL, content > > etc. > > In case of a redirection (I allow 10 redirections in nutch-site.xml) the > > fetched URL is not the original URL the Fetcher turned to, and I would > like > > to get that as well. > > > > Does nutch store it somewhere, I'm basically looking for mapping between > > URLs attempted to fetch and actually fetched. > > > > Thanks, > > > > Amit. > > > >

