Re: Get original URL from crawldb in case of redirect

Amit Sela Sat, 16 Nov 2013 06:53:25 -0800

Would _pst_ exist in metadata even if I'm crawling with:
db.update.additions.allowed=false


(I have a use case where I don't really crawl, but actually just fetch, and
sometimes the list is too long for one execution so I have to re-execute on
the same crawlDB but I don't want to crawl outside the seed list).

Thanks.


On Fri, Nov 15, 2013 at 10:05 PM, Sebastian Nagel <
[email protected]> wrote:

> Hi Amit,
>
> here the answer for Nutch 1.7
> (or are you using 2.x?):
>
> Every URL is stored in CrawlDb even with
>   http.redirect.max = 10
>
> For redirects, the target URL is stored in CrawlDatum's
> metadata under key _pst_ (protocol status):
>
> http://issues.apache.org/jira/browse/NUTCH      Version: 7
> Status: 4 (db_redir_temp)
> Fetch time: Sun Dec 15 20:38:53 CET 2013
> Modified time: Fri Nov 15 20:38:53 CET 2013
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 0.00941915
> Signature: null
> Metadata:
>         Content-Type=text/html
>         _maxdepth_=1000
>         _pst_=temp_moved(13), lastModified=0:
> https://issues.apache.org/jira/browse/NUTCH
>         _depth_=2
>
> Sebastian
>
> On 11/14/2013 12:56 PM, Amit Sela wrote:
> > Hi all,
> >
> > I'm readin the crawldb as CrawledPage and I see the fetched URL, content
> > etc.
> > In case of a redirection (I allow 10 redirections in nutch-site.xml) the
> > fetched URL is not the original URL the Fetcher turned to, and I would
> like
> > to get that as well.
> >
> > Does nutch store it somewhere, I'm basically looking for mapping between
> > URLs attempted to fetch and actually fetched.
> >
> > Thanks,
> >
> > Amit.
> >
>
>

Re: Get original URL from crawldb in case of redirect

Reply via email to