Hi Jens, Thank you very much for your thoughtful input. At the time of writing my question, I was not aware how easy it was (using uimaFIT functionality) to query for Annotations within other Annotations. As such, it seemed to make sense to make hard boundaries between sections. I also only wanted to tokenize and perform word counts on certain sections in the document, without the HTML markup, which required me to clean within sections as well.
Ultimately, however, I decided to do something much like you suggest, except that I created a separate view to hold the extracted and "cleaned" text and Annotations of different sections, tokens, words, etc. My CAS consumer has no difficulty extracting the sections that I need (I have to loop to get every section for a particular speaker, but that's not much of an issue) and working with Annotations within those sections. In other words, your suggested approach works very well, and I appreciate you sharing it with me. Your project looks very interesting, I will keep a lookout for updates. Regards, Matt On Mon, Aug 31, 2015 at 7:30 AM, Jens Grivolla <[email protected]> wrote: > Hi Matt, > > As Richard said, using Views is more designed for having "parallel" > information, such as separate layers of audio, transcript, video, etc. > referring to the same content or "document". > > I'm not quite sure why you want to split your document for processing > (which you could do with a CAS Multiplier). Wouldn't it be much easier to > just maintain and process it as one document, marking the different > segments with e.g. speaker information, etc.? I don't quite understand your > need for splitting, your AEs can run on all the segments (and most can be > instructed not to cross segment boundaries or only work at the sentence > level anyway). > > Of course if what you want is to be able to search for and retrieve > segments that pertain to different speakers then you will need to index > your content in something like Solr outside of UIMA, and while you could > use a CAS Multiplier and then index each generated CAS as a document, it is > much easier to just have a CasConsumer that knows how to deal with your > segment annotations and extracts the information you want to index in an > appropriate form. > > You may want to look at our project EUMSSI (http://eumssi.eu/) which is > about doing exactly this. You can find our initial design here: > http://www.aclweb.org/anthology/W14-5212 which we presented at the last > UIMA workshop (http://glicom.upf.edu/OIAF4HLT/) and some more > documentation > on https://github.com/EUMSSI/EUMSSI-platform/wiki. > > The segment indexing is not in there yet, but I expect to put something on > Github in the next one or two weeks. > > Best, > Jens > > On Wed, Aug 26, 2015 at 4:45 PM, Matthew DeAngelis <[email protected]> > wrote: > > > Hello UIMA Gurus, > > > > I am relatively new to UIMA, so please excuse the general nature of my > > question and any butchering of the terminology. > > > > I am attempting to write an application to process transcripts of audio > > files. Each "raw" transcript is in its own HTML file with a section > listing > > biographical information for the speakers on the call followed by a > number > > of sections containing transcriptions of the discussion of different > > topics. I would like to be able to analyze each speaker's contributions > > separately by topic and then aggregate and compare these analyses between > > speakers and between each speaker and the full text. I was thinking that > I > > would break the document into a new segment each time the speaker or the > > section of the document changes (attaching relevant speaker metadata to > > each section), run additional Analysis Engines on each segment > (tokenizer, > > etc.), and then arbitrarily recombine the results of the analysis by > > speaker, etc. > > > > Looking through the documentation, I am considering two approaches: > > > > 1. Using a CAS Multiplier. Under this approach, I would follow the > example > > in Chapter 7 of the documentation, divide on section and speaker > > demarcations, add metadata to each CAS, run additional AEs on the CASes, > > and then use a multiplier to recombine the many CASes for each document > > (one for the whole transcript, one for each section, one for each > speaker, > > etc.). The advantage of this approach is that it seems easy to > incorporate > > into a pipeline of AEs, since they are designed to run on each CAS. The > > disadvantage is that it seems unwieldy to have to keep track of all of > the > > related CASes per document and aggregate statistics across the CASes. > > > > 2. Use CAS Views. This option is appealing because it seems like CAS > Views > > were designed for associating many different aspects of the same document > > with one another. However, it looks to me that I would have to specify > > different views both when parsing the document into sections and when > > passing them through subsequent AEs, which would make it harder to drop > > into an existing pipeline. I may be misunderstanding how subsequent AEs > > work with Views, however. > > > > For those more experience with UIMA, how would you approach this problem? > > It's entirely possible that I am missing a third (fourth, fifth...) > > approach that would work better than either of those above, so any > guidance > > would be much appreciated. > > > > > > Regards and thanks, > > Matt > > >
