if you referring to these links
href="javascript:__doPostBack('lnkGoa','') <http://districts.nic.in/>"
then these types of links cannot be processed and get discarded by url
normalizer and filter. in fact nutch doesn't run javascript on fetched
content and so it cannot invoke javascript ASP function __doPostBackYou need to live with it. Not sure if this idea has been discussed earlier but it would be interesting to have a way to run javascript on fetched content. Emulate browser in some way.... Best Regards Alexander Aristov On 1 July 2012 22:14, arijit <[email protected]> wrote: > Hi, > I am trying to crawl the url: http://districts.nic.in. The javascript > links contain the meat of all information in this website. However, on > crawling, nutch ignores all these href="javascript:.... links. > I have ensured the following: > nutch-site.xml contains parse-js in plugin.includes. > parse-plugin.xml contains mimeType "application/x-javascript" is handled > by plugin-id="parse-js" > regex-urlfiler.txt does not ignore js|JS - however, not sure this would > have resulted in ignoring of the href="javascript.. part of the website. > > Even forcing the web-site to be parsed as "application/x-javascript" by > the following command: > ./nutch parseChecker -forceAs application/x-javascript " > http://districts.nic.in" does not result in the mentioned hrefs being > picked up as outlinks. > > Any help in this regard, is much appreciated. > -Arijit >

