Re: Regarding crawling of short URL's

Markus Jelsma Tue, 25 Jan 2011 17:57:40 -0800

Reading a URL from the DB returns the HTTP response of that URL, some header 
information and body.  Crawling a URL with a HTTP redirect won't result in the 
HTTP response of the redirection target for that redirecting URL.


> Hi,
> 
> My application needs to crawl a set of urls which I give to the urls
> directory and fetch only the contents of that urls only.
> I am not interested in the contents of the internal or external links.
> So I have run the crawl command by giving depth as 1.
> 
> bin/nutch crawl urls -dir crawl -depth 1
> 
> Nutch crawls the urls and gives me the contents of the given urls.
> 
> I am reading the content using readseg utility.
> 
> bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch
> -nogenerate -noparse -noparsedata
> 
> With this I am fetching the content of webpage.
> 
> The problem I am facing is if I give direct urls like
> 
> http://isoc.org/wp/worldipv6day/
> http://openhackindia.eventbrite.com
> http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
> http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locatio
> ns.php http://bangalore.yahoo.com/labs/summerschool.html
> http://riadevcamp.eventbrite.com
> http://www.sleepingtime.org/
> 
> then I am able to get the contents of the webpage.
> But when I give the set of urls as short urls like
> 
> http://is.gd/jOoAa9
> http://is.gd/ubHRAF
> http://is.gd/GiFqj9
> http://is.gd/H5rUhg
> http://is.gd/wvKINL
> http://is.gd/K6jTNl
> http://is.gd/mpa6fr
> http://is.gd/fmobvj
> http://is.gd/s7uZfr
> 
> I am not able to fetch the contents.
> 
> When I read the segments, it is not showing any content. Please find below
> the content of dump file read from segments.
> 
> Recno:: 0
> URL:: http://is.gd/0yKjO6
> 
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Jan 25 20:56:07 IST 2011
> Modified time: Thu Jan 01 05:30:00 IST 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1295969171407
> 
> Content::
> Version: -1
> url: http://is.gd/0yKjO6
> base: http://is.gd/0yKjO6
> contentType: text/html
> metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0
> Location= http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1
> _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html;
> charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
> Content:
> 
> 
> Recno:: 1
> URL:: http://is.gd/1tpKaN
> 
> Content::
> Version: -1
> url: http://is.gd/1tpKaN
> base: http://is.gd/1tpKaN
> contentType: text/html
> metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0
> Location=
> http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1
> _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html;
> charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
> Content:
> 
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Jan 25 20:56:07 IST 2011
> Modified time: Thu Jan 01 05:30:00 IST 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> 
> 
> I have also tried by setting the max.redirects property in
> nutch-default.xml as 4 but dint find any progress.
> Kindly provide me a solution for this problem.
> 
> Thanks and regards,*
> *Ch. Arjun Kumar Reddy

Re: Regarding crawling of short URL's

Reply via email to