Hi Greg
Thanks for your thoughts, some comments below:

> 1. What's the motivation for merging?  For example, if one is 
> going to put the data into a system whose purpose is 
> retrieving documents (index into a full-text index or insert 
> into a database), then the user may not even want the entire 
> document back as a result.
You may be right and we also are considering this option. This will have
an impact on the way we store the documents and their annotations in our
index/database but maybe it's much more efficient and less error-prone
to do it like that.
Storing  big documents as one "unit of retrieval" in our database is
probably not a good idea because you never present/display the entire
text to the end-user. You always have a section by section or page by
page mecanism for retrieving the document.
So having a hierarchical structure of documents reflected in the
database schema is probably a good idea.

> In short, it's hard for me to imagine a use case where 
> merging results from a huge document would even be desireable.
I can give you several scenarii where working with the entire document
make sense.
Most of our annotators are dedicated to named-entity or relations
extraction and as you say they generally work well on a sub text unit
(sentence or paragraph).
But we also have components that work on the entire document or on the
entire set of annotations that have been extracted by previous
annotators.
For example, we do automatic document categorization using a Naive Bayes
model based on the frequencies of lemmas of all the
nouns/verbs/adjectives present in the text of the document.
Another example is knowledge propagation inside a document: suppose you
have found a named entity of type Person 'Greg Holmberg' at the very
beginning of a document and another instance 'G. Holmberg' at the end.
There is a reasonable chance that the 2 refer to the same person and you
can safely propagate the firstname to the second instance.
Another example can be an automatic summarizer that gives you the 10
more relevant sentences of a document.

> So, I think it's both desirable and necessary to split the 
> document on natural boundaries as it streams into the 
> process, and then just view each segment as a separate 
> document.  These natural boundaries make sense to me, the 
> arbitrarily-sized chunking not so much.
You're right and we already do that, the block size is just an
indication but the actual cut is done on a natural boundary (generally a
sentence boundary)

> 2. If you really need to merge the results, then I would look 
> for a way to incrementally add the pieces to the repository, 
> rather than try to get it all back together in memory.  For 
> example, each segment could update the full-text index, or 
> insert more records in a database, related to the same 
> document ID.  So the repository accumulates results on disk 
> for the document, but the results are never all together in RAM.
Yes this is definitively an interesting option

> Alternatively, move to a 64-bit CPU/OS/JVM with many 
> gigabytes of RAM installed, and process the document as usual 
> (no chunking).  Buying that hardware might be less expensive 
> than the labor involved in making chunking work.  You can buy 
> a quad-core server with 8 GB RAM for $1000 (check out the 
> Dell PowerEdge T105).
> How much is your time worth?
As you can imagine: a lot! That's why we are asking the opinion of the
community before taking the decision to start or not this work.

> 
> 
> Greg Holmberg
> 
> 
>  -------------- Original message ----------------------
> From: "Olivier Terrier" <[EMAIL PROTECTED]>
> > Hi all,
> > Sometimes we are facing the problem of processing 
> collection of "big" documents.
> > This may leads to an instability of the processing chain: 
> > out-of-memory errors, timeouts etc...
> > Moreover this it not very efficient in terms of load 
> balancing (we use 
> > CPEs with analysis engines deployed as Vinci remote 
> services on several machines).
> > We would like to solve this problem implementing a kind of UIMA 
> > document chunking where big documents would be splitted into 
> > reasonable chunks (according to a given block size for 
> example) at the 
> > beginning of the processing chain and merged back into one 
> CAS at the 
> > end.
> > According to us, the splitting phase is quite 
> straightforward : a CAS 
> > multiplier splits the input document into N text blocks and 
> produce N CASes.
> > Chunking informations like:
> > - document identifier
> > - current part number
> > - total part number
> > - text offset
> > Are stored in the CAS.
> > The merging phase is much more complicated : a CAS consumer is 
> > responsible for intercepting each "part" and store it somewhere (in 
> > memory or serialized on the filesystem), when the last part of the 
> > document comes in, all the annotation of the CAS parts are 
> merged back taking into account the offset.
> > As we use a CPE, the merger CAS consumer can't "produce" a new CAS. 
> > What we have in mind is to create a new Sofa 
> "fullDocumentView" in the 
> > last CAS "part" to store the text of the full document 
> along with its associated annotations.
> > Another idea is to use sofa mappings to leave unchanged our 
> existing 
> > CAS consumers (that are sofa-unaware) that come after the 
> merger in the CPE flow.
> >       CPE flow:
> >       
> >     CAS SPLITTER
> > _InitialView: text part_i
> > fullDocumentView: empty
> >           |
> >          AE1
> > _InitialView: text part_i + annotations AE1
> > fullDocumentView: empty
> >           |
> >         ...
> >           |
> >          AEn
> > _InitialView: text partN + annotations AE1+...+AEn
> > fullDocumentView: empty
> >           |
> >      CAS MERGER
> > _InitialView: text part_i + annotations AE1+...+AEn
> > fullDocumentView: if not last part = empty
> >                   if last part = text + annotations merged 
> part1+...+partN
> >           |
> >       CONSUMER (sofa-unaware)
> > MAPPING cpe sofa : fullDocumentView => component sofa : _InitialView
> > _InitialView: text + annotations merged part1+...+partN
> > 
> > The tricky operations are:
> > - caching/storing the CAS 'parts' in the merger: how (XCAS, XMI, 
> > etc..) ? where (memory, disk, ...)?
> > - the merging of CAS 'parts' annotation into the full document CAS.
> > - error management: what append in case of errors on some parts?
> > We would like to share the thoughts/opinions of the UIMA community 
> > regarding this problem and the possible solutions.
> > Do you think our approach is the good one?
> > Does anybody has already faced a similar problem?
> > As far as possible we don't want to reinvent the wheele and give 
> > priority to a generic and ideally a UIMA-builtin implementation. We 
> > are of course ready to contribute to this development if 
> the community find a generic solution.
> > Regards
> > Olivier Terrier - TEMIS
> > 
> 
> 

Reply via email to