Any help in this would be greatly appreciated. Thanks, Vijay
On Oct 6, 2014, at 10:41 AM, Vijay Chakilam <[email protected]> wrote: > Hi, > > I am trying to crawl a bunch of webpages. Many of those redirect to some > other pages. I’ve set the max redirect setting to 5 and was able to fetch the > redirected pages and parse the content and extract text and data. When I use > segment reader and dump data, I am not able to link the original url with the > redirect page that is actually fetched. > > For example, here’s one of the webpages I am trying to fetch: > http://cdn.newsapi.com.au/link/6c35fe0e95b0fb34608eb90c9637f8f1?domain=theaustralian.com.au > > The final redirected page that is fetched in this case is: > http://www.theaustralian.com.au/subscribe/news/1/index.html?sourceCode=TAWEB_WRE170_a&mode=premium&dest=http:/www.theaustralian.com.au/business/opinion/beware-the-watchdogs-bark/story-e6frg9lo-1227077646475?sv=cb5aeda07ef5f9841662884c31232e88&nk=9c8dd2e0c0c2f2ee9809449e54bd040b&memtype=anonymous > > I am attaching the segment reader dump for generate, fetch, parse, parsedata > and parsetext. I am not sure how to link the original url: > http://cdn.newsapi.com.au/link/6c35fe0e95b0fb34608eb90c9637f8f1?domain=theaustralian.com.au > with the final redirect page that is actually fetched and parsed: > http://www.theaustralian.com.au/subscribe/news/1/index.html?sourceCode=TAWEB_WRE170_a&mode=premium&dest=http:/www.theaustralian.com.au/business/opinion/beware-the-watchdogs-bark/story-e6frg9lo-1227077646475?sv=cb5aeda07ef5f9841662884c31232e88&nk=9c8dd2e0c0c2f2ee9809449e54bd040b&memtype=anonymous > > <dump.test.fetch> > <dump.test.generate> > <dump.test.parse> > <dump.test.parsedata> > <dump.test.parsetext> > > Thanks, > Vijay

