Hey,

I used the build in parse-html. 

I only modified the regex-normalize.xml and the regex-urlfilter.txt to also get 
urls with parameters (I allowed '?' and '=' chars in urls). The (modified) 
corresponding line is:
# regex-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc.
-[*!@]

Another bug I already fixed (with the cool regex-normalizer.xml) is masking of 
special chars like the '&' in parameters that are falsely handled by the 
URL-grabber. By default the '&' explodes to long cascades of "&" "&amp" 
"&amp". 

There are some lines like the following missing: 
<!-- regex-normalize.xml fix faulty char encoding-->
<regex>
  <pattern>&amp;amp;</pattern>
  <substitution>&amp;</substitution>
</regex>

Think this would not be too hard to reproduce.

Cheers,
mana

Am 28.06.2011 um 17:33 schrieb Ken Krugler:

> Hi all,
> 
> On Jun 28, 2011, at 4:39am, Markus Jelsma wrote:
> 
>> I've seen trouble before with relative URL's. Perhaps we should improve our 
>> unit tests to incorporate these cases.
> 
> There's a known bug with how the Java URL class handles relative links like 
> this.
> 
> Tika has a work-around, so I'm curious if Matthias ran into this using the 
> built-in HTML parser or the Tika version.
> 
> -- Ken
> 
>> 
>>> Hey,
>>> 
>>> this particular line is commented in my regex-normalize.xml (in yours
>>> too btw). I think it would only match index or default pages as you
>>> said. This is not the case in my example anyway.
>>> 
>>> I guess it has something to do with the link grabber of nutch wich
>>> mis-interprets some special cases?
>>> 
>>> Cheers
>>> 
>>> Am 28.06.11 12:32, schrieb Marek Bachmann:
>>>> I really don't know if I am right, but in my opinion that could
>>>> happen because of the substitution:
>>>> 
>>>> <!-- changes default pages into standard for /index.html, etc. into
>>>> / <regex>
>>>> 
>>>> <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx
>>>> ]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
>>> 
>>> <substitution>/$3</substitution>
>>> 
>>>> </regex> -->
>>>> 
>>>> witch stands by default in the regex-normalize.xml file.
>>>> 
>>>> But, as I get it, this should only happen with page names that
>>>> begin with index or default.
>>>> 
>>>> Hope this was right =)
>>>> 
>>>> Cheers
>>>> 
>>>> On 28.06.2011 11:49, Matthias Naber wrote:
>>>> 
>>>> Hey all,
>>>> 
>>>> has anyone tried crawling pages with URL parameters? I got stuck at
>>>> a point where on a page (lets call it '/dir/page.jsp') containing a
>>>> link like
>>>> 
>>>> <a href="?param=value">my link text</a>
>>>> 
>>>> In the browser everything works fine. Pressing the links will open
>>>> the URL '/dir/page.jsp?param=value'.
>>>> 
>>>> But the nutch-crawler is interpreting this link differently.
>>>> Nutch's result looks like '/dir/?param=value'. So it is trying to
>>>> open the href-target in the acutal directory instead of appending
>>>> the target to actual page as browser would do.
>>>> 
>>>> So the question is: who is wrong: all the browsers or the nutch
>>>> crawler/link interpreter :)
>>>> 
>>>> Cheers, mana
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom data mining solutions
> 
> 
> 
> 
> 
> 

Reply via email to