Thanks a lot. That will be of quite some help. -Arijit
________________________________ From: remi tassing <[email protected]> To: [email protected] Cc: arijit <[email protected]> Sent: Tuesday, July 3, 2012 1:56 PM Subject: Re: javascript in href does not get into outlink I have a similar problem and I'm planning to modify the parsing code...I hope it works On Mon, Jul 2, 2012 at 2:10 PM, Alexander Aristov <[email protected]> wrote: if you referring to these links >href="javascript:__doPostBack('lnkGoa','') <http://districts.nic.in/>" >then these types of links cannot be processed and get discarded by url >normalizer and filter. in fact nutch doesn't run javascript on fetched >content and so it cannot invoke javascript ASP function __doPostBack > >You need to live with it. > > >Not sure if this idea has been discussed earlier but it would be >interesting to have a way to run javascript on fetched content. Emulate >browser in some way.... > >Best Regards >Alexander Aristov > > > >On 1 July 2012 22:14, arijit <[email protected]> wrote: > >> Hi, >> I am trying to crawl the url: http://districts.nic.in. The javascript >> links contain the meat of all information in this website. However, on >> crawling, nutch ignores all these href="javascript:.... links. >> I have ensured the following: >> nutch-site.xml contains parse-js in plugin.includes. >> parse-plugin.xml contains mimeType "application/x-javascript" is handled >> by plugin-id="parse-js" >> regex-urlfiler.txt does not ignore js|JS - however, not sure this would >> have resulted in ignoring of the href="javascript.. part of the website. >> >> Even forcing the web-site to be parsed as "application/x-javascript" by >> the following command: >> ./nutch parseChecker -forceAs application/x-javascript " >> http://districts.nic.in" does not result in the mentioned hrefs being >> picked up as outlinks. >> >> Any help in this regard, is much appreciated. >> -Arijit >> >

