Hello,

With NUTCH-1233 we are going to rely on Tika for outlink extraction, it works 
nicely except for one small issue: consecutive whitespace in an anchor is not 
collapsed to a single character. The anchor text is identical to the HTML 
source and can have surpluss spaces, newlines or tabulators:

<a>    i am an anchor             \n\t\t bla bla</a> does not become "i am an 
anchor bla bla".

Is this a feature of Tika or a bug? Do we have to remove surpluss whitespace in 
Nutch ourselves?

Thanks!
Markus

Reply via email to