Re: javascript in href does not get into outlink

Alexander Aristov Sun, 01 Jul 2012 23:11:20 -0700

if you referring to these links
href="javascript:__doPostBack('lnkGoa','') <http://districts.nic.in/>"
then these types of links cannot be processed and get discarded by url
normalizer and filter. in fact nutch doesn't run javascript on fetched
content and so it cannot invoke javascript ASP function __doPostBack


You need to live with it.


Not sure if this idea has been discussed earlier but it would be
interesting to have a way to run javascript on fetched content. Emulate
browser in some way....

Best Regards
Alexander Aristov


On 1 July 2012 22:14, arijit <[email protected]> wrote:

> Hi,
>    I am trying to crawl the url: http://districts.nic.in. The javascript
> links contain the meat of all information in this website. However, on
> crawling, nutch ignores all these href="javascript:.... links.
>    I have ensured the following:
> nutch-site.xml contains parse-js in plugin.includes.
> parse-plugin.xml contains mimeType "application/x-javascript" is handled
> by plugin-id="parse-js"
> regex-urlfiler.txt does not ignore js|JS - however, not sure this would
> have resulted in ignoring of the href="javascript.. part of the website.
>
>    Even forcing the web-site to be parsed as "application/x-javascript" by
> the following command:
> ./nutch parseChecker -forceAs application/x-javascript "
> http://districts.nic.in"; does not result in the mentioned hrefs being
> picked up as outlinks.
>
>    Any help in this regard, is much appreciated.
> -Arijit
>

Re: javascript in href does not get into outlink

Reply via email to