Hi Jörn, > what is the suggested way to detect a text sofa? > > As far as I know the suggested way of doing it is via the mime type, right? > > Which options remain when the mime type is not set? Is CAS.getDocumentText != > null appropriate ?
in my opinion, a non-text SofA has getDocumentText() == null - it would acquire the data as a stream instead. A text SofA might contain markup, which can be reflected by the mime type. If data is acquired using a stream, the mime-type should probably be considered to decide if the content can be rendered as text. However, the mapping between begin and end offsets to the actual character offsets might not be discernable only from the mime-type. For example if the stream returns HTML, but the offsets refer to a plain-text only "view". Cheers, Richard -- ------------------------------------------------------------------- Richard Eckart de Castilho Technical Lead Ubiquitous Knowledge Processing Lab FB 20 Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone +49 (6151) 16-7477, fax -5455, room S2/02/E225 [email protected] www.ukp.tu-darmstadt.de -------------------------------------------------------------------
