Hi Lewis,
I'm not sure there is a single correct answer here. But my experience is that 
for certain clinical datasets with unusual formatting (even w/o the Tika step) 
that cause problems.

One problem is headers/footers/ascii tables. The problem here is that they get 
called "sentences," and maybe the parser will try to "parse" it, which is 
annoying but probably not a big deal for any downstream components -- cTAKES 
typically isn't claiming to find any relations or clinical entities in these 
sentences.

The other major problem is sentence detection. There is a known problem with 
data in which line breaks ('\n' and '\r' characters) by rule create sentence 
breaks and in certain datasets that is not a valid rule. This _is_ a problem 
for downstream components because the dictionary and most relation extractors 
work _within_ sentences, so an incorrect sentence break can lead to a missed 
entity or relation discovery.

I am presenting work at AMIA this year on a new system and some annotations I 
created for fixing that problem. We are going to test that system on a new 
project for both speed and accuracy and once we are satisfied with it I think 
it will eventually be the default ctakes sentence detector. With that said, it 
may not totally solve the problem(s) you're dealing with. If there is some kind 
of formatting where page 1 of a scanned document has the start of a sentence, 
then page 2 has the end of that sentence, but there's header and footer 
information between, we don't have a solution for you. Probably cTAKES will not 
segment those sentences correctly.

I think there are at least two new types of components/systems that would be 
nice to have some look into (though I am not sure they would be "interesting 
research problems" to any funding agencies):

1) Linguistic information vs. non-linguistic information classifier -- segment 
a given text file into the parts that should be processed linguistically and 
those that should not. This could be a ctakes/uima component.

2) Scanned document preprocessor -- similar to above perhaps, but purpose-built 
for recovering from the kinds of mistakes that occur in scanned documents. Like 
headers/footers going in the middle of the narrative, odd punctuation 
characters, etc. It could be that by first finding non-linguistic information 
and then carefully excising it you could get a resolution to this problem but I 
don't work with enough of this data to have good intuitions about how hard it 
will be.

Hope that is helpful.

Tim




On 09/23/2015 02:07 PM, Lewis John Mcgibbney wrote:
Hi Folks,

I am looking for some feedback on accuracy of cTAKES annotations over input 
text if the input text is not properly formed paragraphs?
Is this known to significantly affect annotation accuracy/performance?
Does anyone have a 'golden' input example of where cTAKES works best for 
annotation accuracy and performance?

My situation is as follows; right now I use Apache Tika to parse a multitude of 
document and I feed the parse result from those documents into cTAKES for 
annotation purposes. Sometimes Tika is not able to form paragraphs correctly as 
the paragraphs are split over a page.

Another example is when footer information (such as page numbers, DOI's, 
Journal names, etc.) exists between pages.

Thanks for any feedback.
Lewis

--
Lewis

Reply via email to