JSParseFilter produces weired URL --------------------------------- Key: NUTCH-807 URL: https://issues.apache.org/jira/browse/NUTCH-807 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.0.0 Environment: Redhat 2.6.18-128.1.6.el5PAE i686 i686 i386 GNU/Linux Reporter: Minyao Zhu
This is found when crawling site: http://zhidao.baidu.com/ ( a Chinese language site ) It appears this page contains javascripts which confused JSParseFilter, which produced URL like this: http://zhidao.baidu.com/){if(A===46){baidu.hide( Not sure the impact/scope of this issue in general. The observation for this specific site is, much less pages got crawled. Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.