Hi,

On Thu, Jul 5, 2012 at 7:51 PM, Markus Jelsma
<[email protected]> wrote:
> Is this a feature of Tika or a bug?

It's a feature at least until someone comes up with a compelling
enough rationale why anchor text should be handled differently.

Note that deciding what to do with cases like "foo<a> bar</a>" or
"foo<a>bar</a>" can be quite tricky. A client like an indexer that
simply ignores all markup should ideally see those as "foo bar" and
"foobar" respectively. It may be difficult to make a parser
implementation that normalizes whitespace in and around anchors work
correctly in all such cases.

> Do we have to remove surpluss whitespace in Nutch ourselves?

I think that's the easiest solution here.

BR,

Jukka Zitting

Reply via email to