Hi Jörn,

> what is the suggested way to detect a text sofa?
> 
> As far as I know the suggested way of doing it is via the mime type, right?
> 
> Which options remain when the mime type is not set? Is CAS.getDocumentText != 
> null appropriate ?

in my opinion, a non-text SofA has getDocumentText() == null - it would acquire 
the data as a stream instead.
A text SofA might contain markup, which can be reflected by the mime type.

If data is acquired using a stream, the mime-type should probably be considered 
to decide if the content can be rendered as text. However, the mapping between 
begin and end offsets to the actual character offsets might not be discernable 
only from the mime-type.
For example if the stream returns HTML, but the offsets refer to a plain-text 
only "view".

Cheers,

Richard

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone +49 (6151) 16-7477, fax -5455, room S2/02/E225
[email protected] 
www.ukp.tu-darmstadt.de 
------------------------------------------------------------------- 





Reply via email to