Hi all,

On Jun 28, 2011, at 4:39am, Markus Jelsma wrote:

> I've seen trouble before with relative URL's. Perhaps we should improve our 
> unit tests to incorporate these cases.

There's a known bug with how the Java URL class handles relative links like 
this.

Tika has a work-around, so I'm curious if Matthias ran into this using the 
built-in HTML parser or the Tika version.

-- Ken

> 
>> Hey,
>> 
>> this particular line is commented in my regex-normalize.xml (in yours
>> too btw). I think it would only match index or default pages as you
>> said. This is not the case in my example anyway.
>> 
>> I guess it has something to do with the link grabber of nutch wich
>> mis-interprets some special cases?
>> 
>> Cheers
>> 
>> Am 28.06.11 12:32, schrieb Marek Bachmann:
>>> I really don't know if I am right, but in my opinion that could
>>> happen because of the substitution:
>>> 
>>> <!-- changes default pages into standard for /index.html, etc. into
>>> / <regex>
>>> 
>>> <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx
>>> ]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
>> 
>> <substitution>/$3</substitution>
>> 
>>> </regex> -->
>>> 
>>> witch stands by default in the regex-normalize.xml file.
>>> 
>>> But, as I get it, this should only happen with page names that
>>> begin with index or default.
>>> 
>>> Hope this was right =)
>>> 
>>> Cheers
>>> 
>>> On 28.06.2011 11:49, Matthias Naber wrote:
>>> 
>>> Hey all,
>>> 
>>> has anyone tried crawling pages with URL parameters? I got stuck at
>>> a point where on a page (lets call it '/dir/page.jsp') containing a
>>> link like
>>> 
>>> <a href="?param=value">my link text</a>
>>> 
>>> In the browser everything works fine. Pressing the links will open
>>> the URL '/dir/page.jsp?param=value'.
>>> 
>>> But the nutch-crawler is interpreting this link differently.
>>> Nutch's result looks like '/dir/?param=value'. So it is trying to
>>> open the href-target in the acutal directory instead of appending
>>> the target to actual page as browser would do.
>>> 
>>> So the question is: who is wrong: all the browsers or the nutch
>>> crawler/link interpreter :)
>>> 
>>> Cheers, mana

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions






Reply via email to