I ran several documents through cTAKES, using AggregatePlaintextUMLSProcessor, and examined the list of org.apache.ctakes.assertion.medfacts.types.Concept annotations produced for each. From those results, I made up a list of phrases I had hoped cTAKES would annotate but did not. I used MetaMap to look up each of those phrases, and found that approximately 150 of them resulted in a full-phrase match and a corresponding CUI.
I used the MetamorphoSys scripts to load the UMLS RRF data set into a SQL DB, and queried the DB to confirm that those ~150 phrases were indeed present with the expected CUIs. So, the question becomes, why didn’t cTAKES annotate them? Looking at the cTAKES logs, it appears the OrangeBookFilter “Filtered out” only 5 out of the 150. The other possible cause I could think of was the TUI filtering; there was no evidence of it in the logs, but I don’t know whether the results of filtering in that step get logged by default or not. I looked up in the DB the TUIs for each of the phrases, compared them to the lists of “allowed” TUIs in LookupDesc_Db.xml, and concluded that the TUI filtering might account for 44 of the phrases. So the rest remain a mystery. I modified the TUI lists in LookupDesc_Db.xml to add TUIs, in the hopes that that would cause the corresponding phrases to be annotated. Specifically, I added T058 to one list, and added a second list with a handful of TUIs: <property key="procedureTuis" value="T058,T059,T060,T061"/> <property key="chemicalanddrugTuis" value="T109,T110,T116,T121,T123"/> T058 corresponded to 3 of the phrases on my list; T121 alone accounted for 24 of them. But, upon restarting cTAKES with that modified file, and running relevant documents, I found that the expected phrases were still not annotated. I even tried making the same change in LookupDesc.xml just in case, to no avail. So, the questions are: - Are there reasons beyond the OrangeBook and TUI filters why CUI-associated phrases in UMLS would not get annotated? - Do TUI-filter results get logged by default, and if not, is there a way (log4j settings?) to log them without making code changes? - Am I doing the TUI filter changes wrong? Thanks for any answers and advice.
