Yes, it makes sense. We'll collapse it in Nutch.

Thanks
Markus

 
 
-----Original message-----
> From:Jukka Zitting <[email protected]>
> Sent: Mon 09-Jul-2012 17:17
> To: [email protected]
> Subject: Re: Surpluss whitespace in outlink anchors not collapsed
> 
> Hi,
> 
> On Thu, Jul 5, 2012 at 7:51 PM, Markus Jelsma
> <[email protected]> wrote:
> > Is this a feature of Tika or a bug?
> 
> It's a feature at least until someone comes up with a compelling
> enough rationale why anchor text should be handled differently.
> 
> Note that deciding what to do with cases like "foo<a> bar</a>" or
> "foo<a>bar</a>" can be quite tricky. A client like an indexer that
> simply ignores all markup should ideally see those as "foo bar" and
> "foobar" respectively. It may be difficult to make a parser
> implementation that normalizes whitespace in and around anchors work
> correctly in all such cases.
> 
> > Do we have to remove surpluss whitespace in Nutch ourselves?
> 
> I think that's the easiest solution here.
> 
> BR,
> 
> Jukka Zitting
> 

Reply via email to