Anuj and Jan, Thank you very much for your tips. I think, I will try the annotation-way: Use an CollectionProcessingEngine to iterate all the Docs in my input-XML. Instatiate a CAS with the input-XML as text. Then run an Annotator converting all XML-Tags into Annotations (I think I am going to set annotation.setBegin() and .setEnd() to something generic like 0). Based on that I'm going to build up my Pipeline. I'll keep you posted as soon as I have some results.
Best Regards Andreas -------- Original-Nachricht -------- > Datum: Wed, 02 Mar 2011 11:46:06 +0100 > Von: "Jörn Kottmann" <[email protected]> > An: [email protected] > Betreff: Re: How to process structured input with UIMA? > On 3/2/11 11:14 AM, Andreas Kahl wrote: > > Mainly I am concerned with the latter: > > Those metadata-records would come in as XML with dozens of fields > containing relatively short texts (most less than 255chars). We need to > perform > NLP (tokenization, stemming ...) and some simpler manipulations like reading > 3 fields and constructing a 4th from that. > > It would be very desirable to use one Framework for both tasks (in fact > we would use the pipeline to enrich the Metadata-Records with the long > texts). > > > > You could take the xml, parse it and then construct a short text which > contains the content togehter > with annoations to mark the existing structure. This new text with the > annotations will be placed in a new view. > Afterward you can perform your processing within these annotation bounds. > > Not sure how you construct the 4th field, but when you can do that > directly after > the xml parsing it could be part of the constructed text. > > With UIMA-AS you should be able to nicely scale the analysis to a few > machines. > > Hope that helps, > Jörn >
