Yes, it makes sense. We'll collapse it in Nutch. Thanks Markus
-----Original message----- > From:Jukka Zitting <[email protected]> > Sent: Mon 09-Jul-2012 17:17 > To: [email protected] > Subject: Re: Surpluss whitespace in outlink anchors not collapsed > > Hi, > > On Thu, Jul 5, 2012 at 7:51 PM, Markus Jelsma > <[email protected]> wrote: > > Is this a feature of Tika or a bug? > > It's a feature at least until someone comes up with a compelling > enough rationale why anchor text should be handled differently. > > Note that deciding what to do with cases like "foo<a> bar</a>" or > "foo<a>bar</a>" can be quite tricky. A client like an indexer that > simply ignores all markup should ideally see those as "foo bar" and > "foobar" respectively. It may be difficult to make a parser > implementation that normalizes whitespace in and around anchors work > correctly in all such cases. > > > Do we have to remove surpluss whitespace in Nutch ourselves? > > I think that's the easiest solution here. > > BR, > > Jukka Zitting >
