Hi Lewis,
I'm not sure there is a single correct answer here. But my experience is that
for certain clinical datasets with unusual formatting (even w/o the Tika step)
that cause problems.
One problem is headers/footers/ascii tables. The problem here is that they get
called "sentences," and maybe the parser will try to "parse" it, which is
annoying but probably not a big deal for any downstream components -- cTAKES
typically isn't claiming to find any relations or clinical entities in these
sentences.
The other major problem is sentence detection. There is a known problem with
data in which line breaks ('\n' and '\r' characters) by rule create sentence
breaks and in certain datasets that is not a valid rule. This _is_ a problem
for downstream components because the dictionary and most relation extractors
work _within_ sentences, so an incorrect sentence break can lead to a missed
entity or relation discovery.
I am presenting work at AMIA this year on a new system and some annotations I
created for fixing that problem. We are going to test that system on a new
project for both speed and accuracy and once we are satisfied with it I think
it will eventually be the default ctakes sentence detector. With that said, it
may not totally solve the problem(s) you're dealing with. If there is some kind
of formatting where page 1 of a scanned document has the start of a sentence,
then page 2 has the end of that sentence, but there's header and footer
information between, we don't have a solution for you. Probably cTAKES will not
segment those sentences correctly.
I think there are at least two new types of components/systems that would be
nice to have some look into (though I am not sure they would be "interesting
research problems" to any funding agencies):
1) Linguistic information vs. non-linguistic information classifier -- segment
a given text file into the parts that should be processed linguistically and
those that should not. This could be a ctakes/uima component.
2) Scanned document preprocessor -- similar to above perhaps, but purpose-built
for recovering from the kinds of mistakes that occur in scanned documents. Like
headers/footers going in the middle of the narrative, odd punctuation
characters, etc. It could be that by first finding non-linguistic information
and then carefully excising it you could get a resolution to this problem but I
don't work with enough of this data to have good intuitions about how hard it
will be.
Hope that is helpful.
Tim
On 09/23/2015 02:07 PM, Lewis John Mcgibbney wrote:
Hi Folks,
I am looking for some feedback on accuracy of cTAKES annotations over input
text if the input text is not properly formed paragraphs?
Is this known to significantly affect annotation accuracy/performance?
Does anyone have a 'golden' input example of where cTAKES works best for
annotation accuracy and performance?
My situation is as follows; right now I use Apache Tika to parse a multitude of
document and I feed the parse result from those documents into cTAKES for
annotation purposes. Sometimes Tika is not able to form paragraphs correctly as
the paragraphs are split over a page.
Another example is when footer information (such as page numbers, DOI's,
Journal names, etc.) exists between pages.
Thanks for any feedback.
Lewis
--
Lewis