Jukka Zitting wrote:
Hi,
Quick summary of the Tika discussions from yesterday's text analysis
BOF at the ApacheCon EU. It's the next morning now, so I'm probably
missing a lot of stuff...
One other thing that we discussed was that it would make sense for some
input formats (such as html) if Tika could produce output that allows
mapping back to the input. In other words, it should be possible
(optionally) to know for each character in the output text where this
character originated in the input. This is useful, for example, for
result highlighting.
This may not be something for the early releases, but it would be good
if we could keep this option in the back of our heads when designing the
interfaces.
--Thilo