-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey,

this particular line is commented in my regex-normalize.xml (in yours
too btw). I think it would only match index or default pages as you
said. This is not the case in my example anyway.

I guess it has something to do with the link grabber of nutch wich
mis-interprets some special cases?

Cheers

Am 28.06.11 12:32, schrieb Marek Bachmann:
> I really don't know if I am right, but in my opinion that could
> happen because of the substitution:
>
> <!-- changes default pages into standard for /index.html, etc. into
> / <regex>
>
> <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
>
>
>
<substitution>/$3</substitution>
> </regex> -->
>
> witch stands by default in the regex-normalize.xml file.
>
> But, as I get it, this should only happen with page names that
> begin with index or default.
>
> Hope this was right =)
>
> Cheers
>
> On 28.06.2011 11:49, Matthias Naber wrote:
>>
> Hey all,
>
> has anyone tried crawling pages with URL parameters? I got stuck at
> a point where on a page (lets call it '/dir/page.jsp') containing a
> link like
>
> <a href="?param=value">my link text</a>
>
> In the browser everything works fine. Pressing the links will open
> the URL '/dir/page.jsp?param=value'.
>
> But the nutch-crawler is interpreting this link differently.
> Nutch's result looks like '/dir/?param=value'. So it is trying to
> open the href-target in the acutal directory instead of appending
> the target to actual page as browser would do.
>
> So the question is: who is wrong: all the browsers or the nutch
> crawler/link interpreter :)
>
> Cheers, mana
>>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk4Jsn8ACgkQzp84az+gLK0kcwCeMrFoiWmh2kqEOPIQF76qeu8q
hqcAn0Lhw6Hpo7UaaEQruqL/4DhWgDBd
=crtF
-----END PGP SIGNATURE-----

Reply via email to