Hi,
   I am trying to crawl the url: http://districts.nic.in. The javascript links 
contain the meat of all information in this website. However, on crawling, 
nutch ignores all these href="javascript:.... links.
   I have ensured the following:
nutch-site.xml contains parse-js in plugin.includes.
parse-plugin.xml contains mimeType "application/x-javascript" is handled by 
plugin-id="parse-js"
regex-urlfiler.txt does not ignore js|JS - however, not sure this would have 
resulted in ignoring of the href="javascript.. part of the website.

   Even forcing the web-site to be parsed as "application/x-javascript" by the 
following command:
./nutch parseChecker -forceAs application/x-javascript 
"http://districts.nic.in"; does not result in the mentioned hrefs being picked 
up as outlinks.

   Any help in this regard, is much appreciated.
-Arijit

Reply via email to