Happy Friday Everyone,
 
Hope Friday afternoon doesn't turn out to be a terrible time to post a 
question. I've been doing a little data mining of patient text medical records 
as of late. I started out trying to predict whether or not cancer patients had 
received KRAS mutation testing and did quite well with that. Now I'm trying to 
predict the results of KRAS testing (mutated vs. wild type). This is proving to 
be a little more difficult.
 
With the first classification task, I created counts of terms (e.g., ""kras", 
"mutated") in the text medical records using the tm package and then used those 
counts to predict whether or not patients had had KRAS mutation testing. I 
tried a few different analyses here, but found that random forests worked the 
best.
 
Predicting the results of testing is harder though because of the way 
physicians and other healthcare professionals write about testing. For example, 
I'm finding phrases like "KRAS mutation returned wild-type". In this example, 
if we're counting, we get 1 instance of "kras", 1 instance of "mutated", and 
one instance of "wild". So you can see how it might be difficult to accurately 
predict the results of testing based on counts alone.
 
My question is how best to deal with this. Are there any R text mining packages 
or related software that would be particularly suited to my problem? I took a 
look at the CRAN Task View: Natural Language Processing and there were so many 
options I didn't really know where to start (and it's not even clear that an 
R-based solution will work best for my problem). Alternatively, is there any 
real chance one could simply write code that would be able to identify true 
references to the results of KRAS testing and then create counts only of what 
are likely to be true references?
 
I'd greatly appreciate it if someone could point me in the right direction.
 
Thanks,
 
Paul 
 
 

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to