Link original url with the final redirected url

Hi,

I am trying to crawl a bunch of webpages. Many of those redirect to some other pages. I’ve set the max redirect setting to 5 and was able to fetch the redirected pages and parse the content and extract text and data. When I use segment reader and dump data, I am not able to link the original url with the redirect page that is actually fetched.

For example, here’s one of the webpages I am trying to fetch: http://cdn.newsapi.com.au/link/6c35fe0e95b0fb34608eb90c9637f8f1?domain=theaustralian.com.au

The final redirected page that is fetched in this case is: http://www.theaustralian.com.au/subscribe/news/1/index.html?sourceCode=TAWEB_WRE170_a&mode=premium&dest=http:/www.theaustralian.com.au/business/opinion/beware-the-watchdogs-bark/story-e6frg9lo-1227077646475?sv=cb5aeda07ef5f9841662884c31232e88&nk=9c8dd2e0c0c2f2ee9809449e54bd040b&memtype=anonymous

I am attaching the segment reader dump for generate, fetch, parse, parsedata and parsetext. I am not sure how to link the original url: http://cdn.newsapi.com.au/link/6c35fe0e95b0fb34608eb90c9637f8f1?domain=theaustralian.com.au with the final redirect page that is actually fetched and parsed: http://www.theaustralian.com.au/subscribe/news/1/index.html?sourceCode=TAWEB_WRE170_a&mode=premium&dest=http:/www.theaustralian.com.au/business/opinion/beware-the-watchdogs-bark/story-e6frg9lo-1227077646475?sv=cb5aeda07ef5f9841662884c31232e88&nk=9c8dd2e0c0c2f2ee9809449e54bd040b&memtype=anonymous

dump.test.fetch
Description: Binary data

dump.test.generate
Description: Binary data

dump.test.parse
Description: Binary data

dump.test.parsedata
Description: Binary data

dump.test.parsetext
Description: Binary data

Thanks,

Vijay

Link original url with the final redirected url

Reply via email to