Reading a URL from the DB returns the HTTP response of that URL, some header information and body. Crawling a URL with a HTTP redirect won't result in the HTTP response of the redirection target for that redirecting URL.
> Hi, > > My application needs to crawl a set of urls which I give to the urls > directory and fetch only the contents of that urls only. > I am not interested in the contents of the internal or external links. > So I have run the crawl command by giving depth as 1. > > bin/nutch crawl urls -dir crawl -depth 1 > > Nutch crawls the urls and gives me the contents of the given urls. > > I am reading the content using readseg utility. > > bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch > -nogenerate -noparse -noparsedata > > With this I am fetching the content of webpage. > > The problem I am facing is if I give direct urls like > > http://isoc.org/wp/worldipv6day/ > http://openhackindia.eventbrite.com > http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/ > http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locatio > ns.php http://bangalore.yahoo.com/labs/summerschool.html > http://riadevcamp.eventbrite.com > http://www.sleepingtime.org/ > > then I am able to get the contents of the webpage. > But when I give the set of urls as short urls like > > http://is.gd/jOoAa9 > http://is.gd/ubHRAF > http://is.gd/GiFqj9 > http://is.gd/H5rUhg > http://is.gd/wvKINL > http://is.gd/K6jTNl > http://is.gd/mpa6fr > http://is.gd/fmobvj > http://is.gd/s7uZfr > > I am not able to fetch the contents. > > When I read the segments, it is not showing any content. Please find below > the content of dump file read from segments. > > Recno:: 0 > URL:: http://is.gd/0yKjO6 > > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Tue Jan 25 20:56:07 IST 2011 > Modified time: Thu Jan 01 05:30:00 IST 1970 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: _ngt_: 1295969171407 > > Content:: > Version: -1 > url: http://is.gd/0yKjO6 > base: http://is.gd/0yKjO6 > contentType: text/html > metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 > Location= http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 > _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; > charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14 > Content: > > > Recno:: 1 > URL:: http://is.gd/1tpKaN > > Content:: > Version: -1 > url: http://is.gd/1tpKaN > base: http://is.gd/1tpKaN > contentType: text/html > metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 > Location= > http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 > _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; > charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14 > Content: > > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Tue Jan 25 20:56:07 IST 2011 > Modified time: Thu Jan 01 05:30:00 IST 1970 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > > > I have also tried by setting the max.redirects property in > nutch-default.xml as 4 but dint find any progress. > Kindly provide me a solution for this problem. > > Thanks and regards,* > *Ch. Arjun Kumar Reddy