Hi Yaakov, I wanted to find out if UIMA has any concept of content segmentation. > Some of the analysis processing is very memory and CPU intensive and > if the content happens to be huge (like a book), it will bring the > server to a crawl. > > So, I was wondering if the UIMA framework has any notion of breaking > up the content into smaller segments. >
Content segmentation is a core concept in UIMA, with each CAS typically considered to contain an "artifact" to be analyzed. Something has to segment the input corpus into discrete artifacts. In the most common scenario, a "collection reader" at the front of the UIMA pipeline segments the input and initializes each CAS. For other scenarios the "CAS Multiplier", a more general segmentation component, is used to initialize CASes. A CAS Multiplier (CM) can be called at any point in a UIMA pipeline; indeed multiple CM components can be used in the same pipeline. Consider a scenario where a CM is given an input CAS with a pointer to a large audio file. The CM could read the audio file, segment at boundaries appropriate for subsequent analysis, and create new CASes with just the audio content for each segment. Note that the artifact to be analyzed, called the Subject of analysis (Sofa), does not have to reside in the CAS itself. UIMA supports the notion of "remote Sofas" represented in the CAS by a URI. UIMA also provides stream access methods for remote Sofa content which in Java simply map to URI stream reading. Hoping this actually addresses your question, Eddie
