Re: Getting original URL for redirect

Mark Achee Wed, 04 May 2011 13:55:38 -0700

Backwards from what you want, but may help.  Using the original URL:

bin/nutch readdb output/crawldb -url 'http://example.org/original/url/'

Replace "output" with the name of your crawl output directory.  If it was
redirected, the "Metadata" will say "moved" and show you where.  If there
were multiple redirects, you'll have to do this multiple times.

-Mark

On Thu, Apr 21, 2011 at 5:23 PM, Chris Woolum <[email protected]>wrote:

> Hey Everyone,
>
>
> I am doing some crawling in which I need to match my crawl data back up
> to my original url set but the problem is that in the case of a
> redirect, only the new URL is saved. Is there any way to get the
> original URL that started the crawl of the redirect?
>
> Thanks, Chris
>

Re: Getting original URL for redirect

Reply via email to