On 6/25/07, Arthit Suriyawongkul <[EMAIL PROTECTED]> wrote:
Hi,

How UIMA load document to memory ?
Does it load the whole document at once, or it only read document
partially (sometime stream-like).

Now I'm using GATE and sometimes got a problem if my document is very large,
as GATE trying to load the whole document into the memory first and
convert it to
its own representation.
My application doesn't need a knowledge of the whole document (like DOM),
but only takes data from a small-size window (e.g. less than 100
characters) at a time.

cheers,
Art


Hi Art,

UIMA is flexible with respect to this.  You can provide a
CollectionReader that populates a CAS with however much text is
appropriate for your application.  So a single document could be split
across many CASes in order to decrease the overall memory
requirements.

It's also possible to split a CAS into smaller CASes, do annotation on
each, and then merge the results.  The kind of component that does the
split and merge is called a "CAS Multiplier".  There's an example of
this in the uimaj-examples project that comes with the download - see
descriptors/cas_multiplier/Segment_Annotate_Merge_AE.  This is
described in the "CAS Multiplier Developer's Guide" section of the
documentation.

Another option is to consider using a "remote Sofa" (Sofa = subject of
analysis).  In this case the CAS just contains a URL to where the
actual document lives, not the document text itself.  See the
"Annotations, Artifacts, and Sofas" section of the documentaiton.

-Adam

Reply via email to