> I've been trying to generalize my code so that I'm not passing a text
> string directly into the CAS when I run UIMA. I would like to do this
> so that I can pass in file paths to Word docs or other types of files
> and have UIMA then extract the text within the engine.
>
UIMA doesn't automatically extract the data referenced by SofaDataURI.
The annotator has to change the way it accesses the Sofa data, the
text in this case, using
InputStream inputStream = aCas.getSofaDataStream();
and then accessing the data using inputStream.
Java has built-in handlers for FILE: and others, but you'll have to be
aware of character encoding issues. The Sofa data stream will also
work for text put into the CAS using setDocumentText or
setSofaDataString, and in this case the InputStream will return the
text String as bytes using UTF-8 encoding.
Eddie