I just want to point out that there is an alternative. I never use
collection readers and cas consumers myself. Instead, I do the reading
of the input and the aggregation of the output outside the framework,
where I have more control over things. Just my opinion though.
See
http://uima.apache.org/d/uimaj-2.4.2/tutorials_and_users_guides.html#ugr.tug.application.using_aes
on how to do that.
--Thilo
On 10/07/2013 03:19 AM, swirl wrote:
Hi,
I am wondering if anyone has a better idea.
Requirement:
a. I have a pipeline that needs to process a bunch of XML files.
b. The XML files could be on the disk, or from a remote location (available
via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml)
c. Each XML file contain mulitple sections, each section's content should be
parsed to produce a separate CAS
d. I need to able to parse XML of different schema. Although the assumption
is that each pipeline run can only handle one specific XML schema. That is, I
do not need to handle different XML schema in each pipeline run.
e. With the above, I need to be able to construct a new collection reader,
parser based on specific needs of each application.
f. For e.g., I can specify that the XML files are in a disk folder, and to
use parser A to decode the specific schema of the XML files. In another
pipeline, I can specify to the collection reader a list of URLs to retrieve
some remote XML files and parse them using parser B.
Here are what I have so far:
a. I am using cleartk's UriCollectionReader to insert URIs of files into the
CAS from local disk folders and remote URIs. So far so good.
b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS
and parse the file according to XML schema A.
c. But the above only produce 1 CAS per XML file. Requirement c. is not
fulfilled. I need to produce multiple CASes from a single XML file. How do I
do this?
Thanks in advance.