I have a similar problem and I'm planning to modify the parsing code...I
hope it works

On Mon, Jul 2, 2012 at 2:10 PM, Alexander Aristov <
[email protected]> wrote:

> if you referring to these links
> href="javascript:__doPostBack('lnkGoa','') <http://districts.nic.in/>"
> then these types of links cannot be processed and get discarded by url
> normalizer and filter. in fact nutch doesn't run javascript on fetched
> content and so it cannot invoke javascript ASP function __doPostBack
>
> You need to live with it.
>
>
> Not sure if this idea has been discussed earlier but it would be
> interesting to have a way to run javascript on fetched content. Emulate
> browser in some way....
>
> Best Regards
> Alexander Aristov
>
>
> On 1 July 2012 22:14, arijit <[email protected]> wrote:
>
> > Hi,
> >    I am trying to crawl the url: http://districts.nic.in. The javascript
> > links contain the meat of all information in this website. However, on
> > crawling, nutch ignores all these href="javascript:.... links.
> >    I have ensured the following:
> > nutch-site.xml contains parse-js in plugin.includes.
> > parse-plugin.xml contains mimeType "application/x-javascript" is handled
> > by plugin-id="parse-js"
> > regex-urlfiler.txt does not ignore js|JS - however, not sure this would
> > have resulted in ignoring of the href="javascript.. part of the website.
> >
> >    Even forcing the web-site to be parsed as "application/x-javascript"
> by
> > the following command:
> > ./nutch parseChecker -forceAs application/x-javascript "
> > http://districts.nic.in"; does not result in the mentioned hrefs being
> > picked up as outlinks.
> >
> >    Any help in this regard, is much appreciated.
> > -Arijit
> >
>

Reply via email to