Sounds good.. All the best! - Anuj
On Wed, Mar 2, 2011 at 7:55 PM, Andreas Kahl <[email protected]> wrote: > Anuj and Jan, > > Thank you very much for your tips. I think, I will try the annotation-way: > Use an CollectionProcessingEngine to iterate all the Docs in my input-XML. > Instatiate a CAS with the input-XML as text. > Then run an Annotator converting all XML-Tags into Annotations (I think I > am going to set annotation.setBegin() and .setEnd() to something generic > like 0). > Based on that I'm going to build up my Pipeline. > I'll keep you posted as soon as I have some results. > > Best Regards > Andreas > > > > -------- Original-Nachricht -------- > > Datum: Wed, 02 Mar 2011 11:46:06 +0100 > > Von: "Jörn Kottmann" <[email protected]> > > An: [email protected] > > Betreff: Re: How to process structured input with UIMA? > > > On 3/2/11 11:14 AM, Andreas Kahl wrote: > > > Mainly I am concerned with the latter: > > > Those metadata-records would come in as XML with dozens of fields > > containing relatively short texts (most less than 255chars). We need to > perform > > NLP (tokenization, stemming ...) and some simpler manipulations like > reading > > 3 fields and constructing a 4th from that. > > > It would be very desirable to use one Framework for both tasks (in fact > > we would use the pipeline to enrich the Metadata-Records with the long > > texts). > > > > > > > You could take the xml, parse it and then construct a short text which > > contains the content togehter > > with annoations to mark the existing structure. This new text with the > > annotations will be placed in a new view. > > Afterward you can perform your processing within these annotation bounds. > > > > Not sure how you construct the 4th field, but when you can do that > > directly after > > the xml parsing it could be part of the constructed text. > > > > With UIMA-AS you should be able to nicely scale the analysis to a few > > machines. > > > > Hope that helps, > > Jörn > > >
