Olivier--

I can't comment on the mechanics of CAS merging that you outline below, but two 
thoughts occur to me.

1. What's the motivation for merging?  For example, if one is going to put the 
data into a system whose purpose is retrieving documents (index into a 
full-text index or insert into a database), then the user may not even want the 
entire document back as a result.  In other words, examine the assumption that 
the unit of retrieval should be the entire document.  It may be more useful to 
return some natural sub-unit, such as a chapter or section.  If the user gets 
back something huge, then he just another searching task to find the 
information somewhere in the 500 pages returned.

As a side benefit, the linguistic analysis may do a better job limited to a 
natural sub-unit, since they are usually more conceptually constrained.  A 
chapter is about one thing, an entire book is about many things.  Also some 
document-level analyses can get out of hand with large documents, such as 
entity co-reference resolution if it's an O(n^2) algorithm.

If the goal is not document retrieval, but text mining for "facts" and so on, 
then the document boundary doesn't matter at all, and again merging isn't 
necessary.  The user just wants the information, and the document boundary 
isn't even visible.

In short, it's hard for me to imagine a use case where merging results from a 
huge document would even be desireable.

I also think that merging just delays the memory problem.  In many cases, 
annotations for parts of speech, named entities, etc. use several times the 
memory of the document itself.  So although this may be less memory than that 
needed while the annotators are running, you're still going to come to a 
document size that can't be handled.  And it may not be much larger than the 
document size that you currently can't handle.

So, I think it's both desirable and necessary to split the document on natural 
boundaries as it streams into the process, and then just view each segment as a 
separate document.  These natural boundaries make sense to me, the 
arbitrarily-sized chunking not so much.


2. If you really need to merge the results, then I would look for a way to 
incrementally add the pieces to the repository, rather than try to get it all 
back together in memory.  For example, each segment could update the full-text 
index, or insert more records in a database, related to the same document ID.  
So the repository accumulates results on disk for the document, but the results 
are never all together in RAM.

Alternatively, move to a 64-bit CPU/OS/JVM with many gigabytes of RAM 
installed, and process the document as usual (no chunking).  Buying that 
hardware might be less expensive than the labor involved in making chunking 
work.  You can buy a quad-core server with 8 GB RAM for $1000 (check out the 
Dell PowerEdge T105).  How much is your time worth?


Greg Holmberg


 -------------- Original message ----------------------
From: "Olivier Terrier" <[EMAIL PROTECTED]>
> Hi all,
> Sometimes we are facing the problem of processing collection of "big" 
> documents.
> This may leads to an instability of the processing chain: out-of-memory 
> errors, 
> timeouts etc...
> Moreover this it not very efficient in terms of load balancing (we use CPEs 
> with 
> analysis engines deployed as Vinci remote services on several machines).
> We would like to solve this problem implementing a kind of UIMA document 
> chunking where
> big documents would be splitted into reasonable chunks (according to a given 
> block size for example) at the beginning of the processing chain and merged 
> back 
> into one CAS at the end.
> According to us, the splitting phase is quite straightforward : a CAS 
> multiplier
> splits the input document into N text blocks and produce N CASes.
> Chunking informations like:
> - document identifier
> - current part number
> - total part number
> - text offset
> Are stored in the CAS.
> The merging phase is much more complicated : a CAS consumer is responsible 
> for 
> intercepting each "part" and store it somewhere (in memory or serialized on 
> the 
> filesystem), when the last part of the document comes in, all the annotation 
> of 
> the CAS parts are merged back taking into account the offset.
> As we use a CPE, the merger CAS consumer can't "produce" a new CAS. What we 
> have 
> in mind is to create a new Sofa "fullDocumentView" in the last CAS "part" to 
> store the text of the full document along with its associated annotations.
> Another idea is to use sofa mappings to leave unchanged our existing CAS 
> consumers (that are sofa-unaware) that come after the merger in the CPE flow.
>       CPE flow:
>       
>     CAS SPLITTER
> _InitialView: text part_i
> fullDocumentView: empty
>           |
>          AE1  
> _InitialView: text part_i + annotations AE1
> fullDocumentView: empty
>           |
>         ...
>           |
>          AEn
> _InitialView: text partN + annotations AE1+...+AEn
> fullDocumentView: empty
>           |
>      CAS MERGER
> _InitialView: text part_i + annotations AE1+...+AEn
> fullDocumentView: if not last part = empty
>                   if last part = text + annotations merged part1+...+partN
>           |
>       CONSUMER (sofa-unaware)
> MAPPING cpe sofa : fullDocumentView => component sofa : _InitialView
> _InitialView: text + annotations merged part1+...+partN
> 
> The tricky operations are:
> - caching/storing the CAS 'parts' in the merger: how (XCAS, XMI, etc..) ? 
> where 
> (memory, disk, ...)?
> - the merging of CAS 'parts' annotation into the full document CAS.
> - error management: what append in case of errors on some parts?
> We would like to share the thoughts/opinions of the UIMA community regarding 
> this problem and the possible solutions.
> Do you think our approach is the good one?
> Does anybody has already faced a similar problem?
> As far as possible we don't want to reinvent the wheele and give priority to 
> a 
> generic and ideally a UIMA-builtin implementation. We are of course ready to 
> contribute to this development if the community find a generic solution.
> Regards
> Olivier Terrier - TEMIS 
> 

Reply via email to