javascript in href does not get into outlink

arijit Sun, 01 Jul 2012 14:05:55 -0700

Hi,
   I am trying to crawl the url: http://districts.nic.in. The javascript links 
contain the meat of all information in this website. However, on crawling, 
nutch ignores all these href="javascript:.... links.
   I have ensured the following:
nutch-site.xml contains parse-js in plugin.includes.
parse-plugin.xml contains mimeType "application/x-javascript" is handled by 
plugin-id="parse-js"
regex-urlfiler.txt does not ignore js|JS - however, not sure this would have 
resulted in ignoring of the href="javascript.. part of the website.


   Even forcing the web-site to be parsed as "application/x-javascript" by the 
following command:
./nutch parseChecker -forceAs application/x-javascript 
"http://districts.nic.in"; does not result in the mentioned hrefs being picked 
up as outlinks.

   Any help in this regard, is much appreciated.
-Arijit

javascript in href does not get into outlink

Reply via email to