On 3/2/11 11:14 AM, Andreas Kahl wrote:
Mainly I am concerned with the latter:
Those metadata-records would come in as XML with dozens of fields containing
relatively short texts (most less than 255chars). We need to perform NLP
(tokenization, stemming ...) and some simpler manipulations like reading 3
fields and constructing a 4th from that.
It would be very desirable to use one Framework for both tasks (in fact we
would use the pipeline to enrich the Metadata-Records with the long texts).
You could take the xml, parse it and then construct a short text which
contains the content togehter
with annoations to mark the existing structure. This new text with the
annotations will be placed in a new view.
Afterward you can perform your processing within these annotation bounds.
Not sure how you construct the 4th field, but when you can do that
directly after
the xml parsing it could be part of the constructed text.
With UIMA-AS you should be able to nicely scale the analysis to a few
machines.
Hope that helps,
Jörn