I have a similar problem and I'm planning to modify the parsing code...I hope it works
On Mon, Jul 2, 2012 at 2:10 PM, Alexander Aristov < [email protected]> wrote: > if you referring to these links > href="javascript:__doPostBack('lnkGoa','') <http://districts.nic.in/>" > then these types of links cannot be processed and get discarded by url > normalizer and filter. in fact nutch doesn't run javascript on fetched > content and so it cannot invoke javascript ASP function __doPostBack > > You need to live with it. > > > Not sure if this idea has been discussed earlier but it would be > interesting to have a way to run javascript on fetched content. Emulate > browser in some way.... > > Best Regards > Alexander Aristov > > > On 1 July 2012 22:14, arijit <[email protected]> wrote: > > > Hi, > > I am trying to crawl the url: http://districts.nic.in. The javascript > > links contain the meat of all information in this website. However, on > > crawling, nutch ignores all these href="javascript:.... links. > > I have ensured the following: > > nutch-site.xml contains parse-js in plugin.includes. > > parse-plugin.xml contains mimeType "application/x-javascript" is handled > > by plugin-id="parse-js" > > regex-urlfiler.txt does not ignore js|JS - however, not sure this would > > have resulted in ignoring of the href="javascript.. part of the website. > > > > Even forcing the web-site to be parsed as "application/x-javascript" > by > > the following command: > > ./nutch parseChecker -forceAs application/x-javascript " > > http://districts.nic.in" does not result in the mentioned hrefs being > > picked up as outlinks. > > > > Any help in this regard, is much appreciated. > > -Arijit > > >

