On 4/4/11 2:17 PM, Richard Eckart de Castilho wrote:
Hi Jörn,

what is the suggested way to detect a text sofa?

As far as I know the suggested way of doing it is via the mime type, right?

Which options remain when the mime type is not set? Is CAS.getDocumentText != 
null appropriate ?
in my opinion, a non-text SofA has getDocumentText() == null - it would acquire 
the data as a stream instead.
A text SofA might contain markup, which can be reflected by the mime type.

If data is acquired using a stream, the mime-type should probably be considered 
to decide if the content can be rendered as text. However, the mapping between 
begin and end offsets to the actual character offsets might not be discernable 
only from the mime-type.
For example if the stream returns HTML, but the offsets refer to a plain-text only 
"view".
The begin and end features of uima.tcas.Annotation should only be used as offsets inside the document text, otherwise
a few methods might throw exceptions because of invalid text bounds.
The Annotation Editor inside the Cas Editor also builds upon this assumption.

Jörn

Reply via email to