Hi,

I am trying to crawl a bunch of webpages. Many of those redirect to some other pages. I’ve set the max redirect setting to 5 and was able to fetch the redirected pages and parse the content and extract text and data. When I use segment reader and dump data, I am not able to link the original url with the redirect page that is actually fetched.

For example, here’s one of the webpages I am trying to fetch: http://cdn.newsapi.com.au/link/6c35fe0e95b0fb34608eb90c9637f8f1?domain=theaustralian.com.au



Attachment: dump.test.fetch
Description: Binary data

Attachment: dump.test.generate
Description: Binary data

Attachment: dump.test.parse
Description: Binary data

Attachment: dump.test.parsedata
Description: Binary data

Attachment: dump.test.parsetext
Description: Binary data


Thanks,
Vijay

Reply via email to