Hi,
I've been using nutch 0.7 for some time and now I'm trying to migrate to
1.1. Comparing the results in the old and new versions I've seen that some
URLs that used to be crawled are not there anymore.
I've checked one of them and they are javascript redirects that are
performed when the page is loaded. Looking into the segments I've seen that
the old version found more outlinks than the new one. Some of these links
look wrong, they are links to js resources and things like that but the one
that I'm now missing is part of an embedded js function that contains the
redirected URL. Something like:
<script language="JavaScript1.2" type="text/javascript">
function redirect() {
window.location="/vacacional/seleccionaPaquete.do?npqpaquetes=16657&origen_listado=false";
}
</script>
I've checked that the parse-js plugin is enabled, apart from that I don't
know what could be the problem, it looks like the crawling is more "polite"
and all outlinks are accurate but it is also missing some others.
Could it be some configuration parameter? The implementation of the parse-js
plugin?
Any help will be appreciated.