Hello Oli,

I know of two strategies:

1) READER+AE: use a reader to control from where the data is retrieved. The 
reader reads the raw data format, e.g. a PDF file. Then a subsequent analysis 
engine converts the raw data into what is actually to be processed, e.g. 
extracting the text from the PDF. I think that ClearTK [1] is going into this 
direction nowadays.

2) READER+PLUGIN: use a reader to perform the data conversion. The reader may 
be configured with a strategy that controls from where the data is obtained. 
DKPro Core [2] is going into that direction. Most readers can be configured 
with a custom Spring ResourcePatternResolver, e.g. to access files from a HDFS 
(afaik a corresponding ResourcePatternResolver is included in Spring for Apache 
Hadoop [3]). I also did a proof-of-concept ResourcePatternResolver for Samba 
shares once. 

I guess it boils down to whether you consider it important to have the raw data 
in the CAS. Some people may see that as a benefit, others may consider it a 
waste of memory.

In the olden times, there was a thing called CasInitializer [4] which appears 
to have been a plugin that a reader could use to extract information from the 
raw data and fill it into the CAS. Sounds like approach 2) mentioned above. 
However, the CasInitializer has been deprecated for quite some time now and its 
Javadoc says to use different views instead (sounds like approach 1). Maybe 
somebody else can provide some detail as to why the CasInitializer was 
deprecated - I never used it, but I always thought it sounded like a quite 
useful concept.

Cheers,

-- Richard

[1] http://cleartk.googlecode.com
[2] https://code.google.com/p/dkpro-core-asl/
[3] http://projects.spring.io/spring-hadoop/
[4] 
http://uima.apache.org/downloads/releaseDocs/2.3.0-incubating/docs/api/org/apache/uima/collection/CasInitializer.html

P.S.: none of the mentioned projects are ASF projects. I am affiliated with the 
DKPro Core project.

On 29.05.2014, at 15:11, Oliver Christ <[email protected]> wrote:

> Hi,
> 
> From my (still very limited) UIMA experience it seems that collection readers 
> address how to retrieve documents from some location and how to import (or 
> filter) that document into the CAS.
> 
> Filtering (i.e. file format-specific processing) can be seen as independent 
> of where the data is retrieved from. I'm wondering whether there's a "UIMA 
> way" to separate the two aspects, i.e. a model consisting of two components; 
> one which abstracts storage and retrieval, and the second addressing file 
> format filtering.
> 
> Thanks!
> 
> Cheers, Oli

Reply via email to