I updated the UIMA website's sandbox page with this information. -Marshall
Julien Nioche wrote: > Hi Marshall, > > There is a description in the README.txt file from the TikaAnnotator > repository, which I have slightly rewritten into the text below. > > > *Apache Tika is a toolkit for detecting and extracting metadata and > structured text content from various documents using existing parser > libraries. The TikaAnnotator uses Tika to generate annotations representing > the original markup of a document, extract its text and metadata. It > consists of three resources : > > - FileSystemCollectionReader : similar to the one in UIMA examples but uses > TIKA to extract the text from binary documents and generates annotations to > represent the markup > > - MarkupAnnotator : takes the original content from a view and generates a > new view containing the extracted text with markup annotations > > - TikaWrapper : utility class which allows to populate a CAS from a binary > document; used by the FileSystemCollectionReader * > > > Best, > > J. > >
