Julien Nioche <[EMAIL PROTECTED]> writes: > > Hi guys, > > Did anyone give https://issues.apache.org/jira/browse/UIMA-1095 a try? Any > thoughts on it? > > Best, > > J. > Hi Julien,
Thanks for that contribution. I think that kind of functionality is important for UIMA. I gave it a first try. I have just used it and did not seriously look at the code yet. Here is some initial, unsorted user feedback: - Having a binary TIKA jar would speed things up (needed help to get that built) - It worked fine for me once I got the jar - In my initial trial setup I added both the Tika CollectionReader and the TIKA MarkupAnnotator to a CPE flow assuming that's what's needed. Only after overcoming some confusion about the resulting CASes I realized that they are intended to be used either/or. A word in the README may spare other people the confusion. - MarkupAnnotator.xml states <outputsNewCASes>true</outputsNewCASes>. CVD will not show any results for annotators with that setting. And in fact the annotator runs just fine with that setting changed to false. From what I could see in the code it just creates a new view not a new CAS. But maybe I am missing something here. - It returned reasonable results on a few HTML, MS-Word and PPT files I tried. I silently refused to covert one PDF file (others worked). But I guess this are just limitations of the current PDF parser. - The typesystem does have the necessary information needed for further processing. - As I understand it TIKA maps all document markup to the XHTML tagset. Since that is a closed set it should be possible to use a more explicit typesystem modeling, where the known XHTML elements like title, body, p etc. are modeled as explicit subtypes instead of having only one generic type MarkupAnnotation. Is that assumption correct? Which typesystem representation to use depends on use case (and taste :-) but finding and iterating over the different parts of the markup would be easier with explicit types. - I think for document level meta data attributes the situation is different since it's open (but there may be core set as well). So far for the first impressions. Good work. - Thomas
