I really don't know if I am right, but in my opinion that could happen because of the substitution:

<!-- changes default pages into standard for /index.html, etc. into /
<regex>

<pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
  <substitution>/$3</substitution>
</regex> -->

witch stands by default in the regex-normalize.xml file.

But, as I get it, this should only happen with page names that begin with index or default.

Hope this was right =)

Cheers

On 28.06.2011 11:49, Matthias Naber wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey all,

has anyone tried crawling pages with URL parameters? I got stuck at a
point where on a page (lets call it '/dir/page.jsp') containing a link
like

<a href="?param=value">my link text</a>

In the browser everything works fine. Pressing the links will open the
URL '/dir/page.jsp?param=value'.

But the nutch-crawler is interpreting this link differently. Nutch's
result looks like '/dir/?param=value'. So it is trying to open the
href-target in the acutal directory instead of appending the target to
actual page as browser would do.

So the question is: who is wrong: all the browsers or the nutch
crawler/link interpreter :)

Cheers,
mana
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4Jo6wACgkQzp84az+gLK0r3gCfaMXgcGr9hiLT3b5WebtZkOCm
0o0AmQGIrIaSTXfKUoa055fSs0UaX9E7
=SlLZ
-----END PGP SIGNATURE-----


Reply via email to