AW: SourceDocumentInformation

Armin.Wegner Sun, 27 Nov 2011 22:30:56 -0800

Hi Richard,

thank you. I did it the hard way using CAS. JCas works fine as well. In both 
cases SourceDocumentInformation.xml has to be included as a type system.


For the latter case, I derived a JCas from a CAS with getJCas() local to the 
process method as in 
org.apache.uima.examples.cpe.FileSystemCollectionReader.java and used the 
SourceDocumentInformation class to fill that annotation attributes in. What 
happens if I make a JCas from a CAS that way? Is it just another frontend for 
the same data or is the whole CAS data copied/duplicated to a new JCas 
instance? 

Regards,

Armin

-----Ursprüngliche Nachricht-----
Von: Richard Eckart de Castilho [mailto:[email protected]] 
Gesendet: Mittwoch, 23. November 2011 10:00
An: [email protected]
Betreff: Re: SourceDocumentInformation

Hello Armin,

UIMA does not provide for this piece of information in the CAS. You can use the 
SourceDocumentInformation and you can also use it with CAS if you want, but you 
will have to access it using the complicated way, e.g. something like this:

  Type t = 
cas.getTypeSystem().getType("org.apache.uima.examples.SourceDocumentInformation");
  AnnotationFS anno = cas.createAnnotation(type, 0, 
cas.getDocumentText().length());
                anno.setStringValue(type.getFeatureByBaseName("uri"), 
"file:/path/to/file.txt");
  cas.addToIndexes(anno);

In DKPro Core we define a DocumentMetaData type which is our replacement for 
SourceDocumentInformation and used by our readers and writers. It provides the 
fields:

  documentTitle
  documentBaseUri
  documentUri
  collectionId
  documentId
  isLastSegment

We currently do not have the fields "offsetInSource" and "documentSize". I 
think I should add these.

Anyway, you can define your own metadata annotation type inheriting from 
DocumentAnnotation and use that. You should add it to the CAS before setting 
any language or text though, because otherwise UIMA will automatically create a 
default DocumentAnnotation in the CAS and you will end up with two meta data 
annotations. If you add yours first, UIMA will use it and it will be accessible 
via CAS.getDocumentAnnotation() as well.

Best,

-- Richard

Am 23.11.2011 um 09:16 schrieb [email protected]:

> Hi!
> 
> I need to know the name of the source documents when writing the 
> resulting CASes from a pipline which starts be reading source 
> documents with a collection reader. I thougt that 
> org.apache.umia.examples.SourceDocumentInformation is the correct 
> means to do it. But it is just an example and it works with JCas only. 
> Is there no SourceDocumentInformation for CAS? Is this really the way 
> to do it or are there other means as well? Is it my responsibility to 
> fill in the values in a collection reader or is it done automatically?
> 
> Regards,
> 
> Armin

--
-------------------------------------------------------------------
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab 
FB 20 Computer Science Department      
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-7477, fax 
-5455, room S2/02/B117 [email protected]
www.ukp.tu-darmstadt.de
Web Research at TU Darmstadt (WeRC) www.werc.tu-darmstadt.de
-------------------------------------------------------------------

AW: SourceDocumentInformation

Reply via email to