Re: document structure

Marshall Schor Fri, 29 May 2009 14:03:27 -0700

I updated the UIMA website's sandbox page with this information.

-Marshall


Julien Nioche wrote:
> Hi Marshall,
>
> There is a description in the README.txt file from the TikaAnnotator
> repository, which I have slightly rewritten into the text below.
>
>
> *Apache Tika is a toolkit for detecting and extracting metadata and
> structured text content from various documents using existing parser
> libraries. The TikaAnnotator uses Tika to generate annotations representing
> the original markup of a document, extract its text and metadata. It
> consists of three resources :
>
> - FileSystemCollectionReader : similar to the one in UIMA examples but uses
> TIKA to extract the text from binary documents and generates annotations to
> represent the markup
>
> - MarkupAnnotator : takes the original content from a view and generates a
> new view containing the extracted text with markup annotations
>
> - TikaWrapper : utility class which allows to populate a CAS from a binary
> document; used by the FileSystemCollectionReader *
>
>
> Best,
>
> J.
>
>

Re: document structure

Reply via email to