Hi, I am wondering if anyone has a better idea. Requirement: a. I have a pipeline that needs to process a bunch of XML files. b. The XML files could be on the disk, or from a remote location (available via a HTTP GET call, e.g. http://example.com/inputFiles/001.xml) c. Each XML file contain mulitple sections, each section's content should be parsed to produce a separate CAS d. I need to able to parse XML of different schema. Although the assumption is that each pipeline run can only handle one specific XML schema. That is, I do not need to handle different XML schema in each pipeline run. e. With the above, I need to be able to construct a new collection reader, parser based on specific needs of each application. f. For e.g., I can specify that the XML files are in a disk folder, and to use parser A to decode the specific schema of the XML files. In another pipeline, I can specify to the collection reader a list of URLs to retrieve some remote XML files and parse them using parser B.
Here are what I have so far: a. I am using cleartk's UriCollectionReader to insert URIs of files into the CAS from local disk folders and remote URIs. So far so good. b. I created a AE UriToDocumentAnnotatorA that can reads the URI in the CAS and parse the file according to XML schema A. c. But the above only produce 1 CAS per XML file. Requirement c. is not fulfilled. I need to produce multiple CASes from a single XML file. How do I do this? Thanks in advance.
