Hi, On Thu, Jul 5, 2012 at 7:51 PM, Markus Jelsma <[email protected]> wrote: > Is this a feature of Tika or a bug?
It's a feature at least until someone comes up with a compelling enough rationale why anchor text should be handled differently. Note that deciding what to do with cases like "foo<a> bar</a>" or "foo<a>bar</a>" can be quite tricky. A client like an indexer that simply ignores all markup should ideally see those as "foo bar" and "foobar" respectively. It may be difficult to make a parser implementation that normalizes whitespace in and around anchors work correctly in all such cases. > Do we have to remove surpluss whitespace in Nutch ourselves? I think that's the easiest solution here. BR, Jukka Zitting
