Any help in this would be greatly appreciated.

Thanks,
Vijay

On Oct 6, 2014, at 10:41 AM, Vijay Chakilam <[email protected]> wrote:

> Hi,
> 
> I am trying to crawl a bunch of webpages. Many of those redirect to some 
> other pages. I’ve set the max redirect setting to 5 and was able to fetch the 
> redirected pages and parse the content and extract text and data. When I use 
> segment reader and dump data, I am not able to link the original url with the 
> redirect page that is actually fetched.
> 
> For example, here’s one of the webpages I am trying to fetch: 
> http://cdn.newsapi.com.au/link/6c35fe0e95b0fb34608eb90c9637f8f1?domain=theaustralian.com.au
> 
> The final redirected page that is fetched in this case is: 
> http://www.theaustralian.com.au/subscribe/news/1/index.html?sourceCode=TAWEB_WRE170_a&mode=premium&dest=http:/www.theaustralian.com.au/business/opinion/beware-the-watchdogs-bark/story-e6frg9lo-1227077646475?sv=cb5aeda07ef5f9841662884c31232e88&nk=9c8dd2e0c0c2f2ee9809449e54bd040b&memtype=anonymous
> 
> I am attaching the segment reader dump for generate, fetch, parse, parsedata 
> and parsetext. I am not sure how to link the original url: 
> http://cdn.newsapi.com.au/link/6c35fe0e95b0fb34608eb90c9637f8f1?domain=theaustralian.com.au
>  with the final redirect page that is actually fetched and parsed: 
> http://www.theaustralian.com.au/subscribe/news/1/index.html?sourceCode=TAWEB_WRE170_a&mode=premium&dest=http:/www.theaustralian.com.au/business/opinion/beware-the-watchdogs-bark/story-e6frg9lo-1227077646475?sv=cb5aeda07ef5f9841662884c31232e88&nk=9c8dd2e0c0c2f2ee9809449e54bd040b&memtype=anonymous
> 
> <dump.test.fetch>
> <dump.test.generate>
> <dump.test.parse>
> <dump.test.parsedata>
> <dump.test.parsetext>
> 
> Thanks,
> Vijay

Reply via email to