Thanks a lot. That will be of quite some help.
-Arijit


________________________________
 From: remi tassing <[email protected]>
To: [email protected] 
Cc: arijit <[email protected]> 
Sent: Tuesday, July 3, 2012 1:56 PM
Subject: Re: javascript in href does not get into outlink
 

I have a similar problem and I'm planning to modify the parsing code...I hope 
it works


On Mon, Jul 2, 2012 at 2:10 PM, Alexander Aristov <[email protected]> 
wrote:

if you referring to these links
>href="javascript:__doPostBack('lnkGoa','') <http://districts.nic.in/>"
>then these types of links cannot be processed and get discarded by url
>normalizer and filter. in fact nutch doesn't run javascript on fetched
>content and so it cannot invoke javascript ASP function __doPostBack
>
>You need to live with it.
>
>
>Not sure if this idea has been discussed earlier but it would be
>interesting to have a way to run javascript on fetched content. Emulate
>browser in some way....
>
>Best Regards
>Alexander Aristov
>
>
>
>On 1 July 2012 22:14, arijit <[email protected]> wrote:
>
>> Hi,
>>    I am trying to crawl the url: http://districts.nic.in. The javascript
>> links contain the meat of all information in this website. However, on
>> crawling, nutch ignores all these href="javascript:.... links.
>>    I have ensured the following:
>> nutch-site.xml contains parse-js in plugin.includes.
>> parse-plugin.xml contains mimeType "application/x-javascript" is handled
>> by plugin-id="parse-js"
>> regex-urlfiler.txt does not ignore js|JS - however, not sure this would
>> have resulted in ignoring of the href="javascript.. part of the website.
>>
>>    Even forcing the web-site to be parsed as "application/x-javascript" by
>> the following command:
>> ./nutch parseChecker -forceAs application/x-javascript "
>> http://districts.nic.in"; does not result in the mentioned hrefs being
>> picked up as outlinks.
>>
>>    Any help in this regard, is much appreciated.
>> -Arijit
>>
>

Reply via email to