You can use the MITRE MIST tool for the deidentification. It allows re-training, etc. You have to run it as a pre-processor independent of cTAKES, then use its output as the input to cTAKES. http://mist-deid.sourceforge.net/
Compete de-identification is an unsolved problem though, there are no guarantees there would be no leaks. I hope this helps. --Guergana Savova, PhD, FACMI Associate Professor PI Natural Language Processing Lab Boston Children's Hospital and Harvard Medical School 300 Longwood Avenue Mailstop: BCH3092 Enders 144.1 Boston, MA 02115 Tel: (617) 919-2972 Fax: (617) 730-0817 [email protected]<mailto:[email protected]> Harvard Scholar: http://scholar.harvard.edu/guergana_k_savova/biocv ctakes.apache.org thyme.healthnlp.org cancer.healthnlp.org share.healthnlp.org From: Dipankar Ray [mailto:[email protected]] Sent: Friday, January 13, 2017 6:01 PM To: [email protected] Subject: de-identification Hi folks, Apologies if this is a newbie question - tried to look for an earlier occurrence of it, but was unsuccessful. From this website (https://open.med.harvard.edu/project/scrubber/) I learned that the Scrubber de-identification tool is now available as part of CTAKES. But I didn't see anything about de-identification listed among the components here: https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+3.2+Component+Use+Guide<https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_CTAKES_cTAKES-2B3.2-2BComponent-2BUse-2BGuide&d=DgMFaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-j0kfEcrO4yRGmRCJNAr-rCmP&m=BPD8OBFn5bnp0ZZrPiqD5jss63CaCnPz943cABqbAi4&s=5vXOFR62vx5O31vm16WYuFde-0OzHIogPqEqhO4gcmY&e=> Question: How do I use CTAKES for de-identification? best, Dipankar
