I've seen trouble before with relative URL's. Perhaps we should improve our 
unit tests to incorporate these cases.

> Hey,
> 
> this particular line is commented in my regex-normalize.xml (in yours
> too btw). I think it would only match index or default pages as you
> said. This is not the case in my example anyway.
> 
> I guess it has something to do with the link grabber of nutch wich
> mis-interprets some special cases?
> 
> Cheers
> 
> Am 28.06.11 12:32, schrieb Marek Bachmann:
> > I really don't know if I am right, but in my opinion that could
> > happen because of the substitution:
> > 
> > <!-- changes default pages into standard for /index.html, etc. into
> > / <regex>
> > 
> > <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx
> > ]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
> 
> <substitution>/$3</substitution>
> 
> > </regex> -->
> > 
> > witch stands by default in the regex-normalize.xml file.
> > 
> > But, as I get it, this should only happen with page names that
> > begin with index or default.
> > 
> > Hope this was right =)
> > 
> > Cheers
> > 
> > On 28.06.2011 11:49, Matthias Naber wrote:
> > 
> > Hey all,
> > 
> > has anyone tried crawling pages with URL parameters? I got stuck at
> > a point where on a page (lets call it '/dir/page.jsp') containing a
> > link like
> > 
> > <a href="?param=value">my link text</a>
> > 
> > In the browser everything works fine. Pressing the links will open
> > the URL '/dir/page.jsp?param=value'.
> > 
> > But the nutch-crawler is interpreting this link differently.
> > Nutch's result looks like '/dir/?param=value'. So it is trying to
> > open the href-target in the acutal directory instead of appending
> > the target to actual page as browser would do.
> > 
> > So the question is: who is wrong: all the browsers or the nutch
> > crawler/link interpreter :)
> > 
> > Cheers, mana

Reply via email to