Hi,

On 5/3/07, Thilo Goetz <[EMAIL PROTECTED]> wrote:
One other thing that we discussed was that it would make sense for some
input formats (such as html) if Tika could produce output that allows
mapping back to the input.  In other words, it should be possible
(optionally) to know for each character in the output text where this
character originated in the input.  This is useful, for example, for
result highlighting.

I think the best technical solution to this (assuming we use XHTML SAX
events) is to embed such backmapping information as namespaced
attributes in the output event stream. For example a PDF document
could result in something like this:

   <html xmlns="...xhtml" xmlns:pdf="...tika-pdf-annotations">
     <head>...</head>
     <body>
       <h1 pdf:location="...">...</h1>
       <p pdf:location="...">...</p>
     </body>
   </html>

If more granularity is needed, the parser component could produce
extra <span/> elements for example for each line or even word in the
source document:

   ...<span pdf:location="...">...</span>...

This may not be something for the early releases, but it would be good
if we could keep this option in the back of our heads when designing the
interfaces.

Agreed. I think a namespaced annotation mechanism like the one
suggested above would be an easy and forward-compatible way to add
such functionality.

BR,

Jukka Zitting

Reply via email to