Hi, I am trying to crawl the url: http://districts.nic.in. The javascript links contain the meat of all information in this website. However, on crawling, nutch ignores all these href="javascript:.... links. I have ensured the following: nutch-site.xml contains parse-js in plugin.includes. parse-plugin.xml contains mimeType "application/x-javascript" is handled by plugin-id="parse-js" regex-urlfiler.txt does not ignore js|JS - however, not sure this would have resulted in ignoring of the href="javascript.. part of the website.
Even forcing the web-site to be parsed as "application/x-javascript" by the following command: ./nutch parseChecker -forceAs application/x-javascript "http://districts.nic.in" does not result in the mentioned hrefs being picked up as outlinks. Any help in this regard, is much appreciated. -Arijit

