Hello, With NUTCH-1233 we are going to rely on Tika for outlink extraction, it works nicely except for one small issue: consecutive whitespace in an anchor is not collapsed to a single character. The anchor text is identical to the HTML source and can have surpluss spaces, newlines or tabulators:
<a> i am an anchor \n\t\t bla bla</a> does not become "i am an anchor bla bla". Is this a feature of Tika or a bug? Do we have to remove surpluss whitespace in Nutch ourselves? Thanks! Markus
