Hello everyone, 

I am currently evaluating UIMA as a possible unified document processing 
framework for our data. On the one hand we need to process large, unstructured 
texts (language detection, linguistic normalization, entity extraction etc.), 
on the other hand we have millions of structured Metadata-Records to process. 

Mainly I am concerned with the latter: 
Those metadata-records would come in as XML with dozens of fields containing 
relatively short texts (most less than 255chars). We need to perform NLP 
(tokenization, stemming ...) and some simpler manipulations like reading 3 
fields and constructing a 4th from that. 
It would be very desirable to use one Framework for both tasks (in fact we 
would use the pipeline to enrich the Metadata-Records with the long texts). 

Reading the documentation I can imagine three different ways to process 
structured (XML-)documents:
1. Find some way to add multiple text fields into one CAS, so a CAS-Processor 
(Analysis Engine) can access multiple of those fields at a time and manipulate 
them. (not cas.setDocumentText() - as I understand this would imply to have 
only one input field) Is there an Interface in the Collection Processing Engine 
to map XML-Fields to CAS-Fields?
2. Or am I better off using multiple CAS-Views? (but the XML-Fields are not 
different representations of the same content, they contain disjunct categories 
like author or title)
3. Is there possibly some smart way to generate multiple Sub-CASes, each 
containing one field? 
In cases 2 and 3 I am unsure whether my Analysis Engines would still be able to 
access multiple fields at once. 

Which of those are feasible at all, and which would you recommend to use? 

Thanks for any hints or experiences. 

Andreas

Reply via email to