On 6/25/07, Arthit Suriyawongkul <[EMAIL PROTECTED]> wrote:
Hi,
How UIMA load document to memory ?
Does it load the whole document at once, or it only read document
partially (sometime stream-like).
Now I'm using GATE and sometimes got a problem if my document is very large,
as GATE trying to load the whole document into the memory first and
convert it to
its own representation.
My application doesn't need a knowledge of the whole document (like DOM),
but only takes data from a small-size window (e.g. less than 100
characters) at a time.
cheers,
Art
Hi Art,
UIMA is flexible with respect to this. You can provide a
CollectionReader that populates a CAS with however much text is
appropriate for your application. So a single document could be split
across many CASes in order to decrease the overall memory
requirements.
It's also possible to split a CAS into smaller CASes, do annotation on
each, and then merge the results. The kind of component that does the
split and merge is called a "CAS Multiplier". There's an example of
this in the uimaj-examples project that comes with the download - see
descriptors/cas_multiplier/Segment_Annotate_Merge_AE. This is
described in the "CAS Multiplier Developer's Guide" section of the
documentation.
Another option is to consider using a "remote Sofa" (Sofa = subject of
analysis). In this case the CAS just contains a URL to where the
actual document lives, not the document text itself. See the
"Annotations, Artifacts, and Sofas" section of the documentaiton.
-Adam