RE: How to add a new dictionary database to cTAKES

Finan, Sean Fri, 28 Feb 2014 07:01:31 -0800

Hi Abhishek,

You have some interesting timing ...
I can give you the xml specifications that you require if you send me the 
format of your dictionary.


Since you are new to the current dictionary module setup, I might also have a 
simpler solution for you ...

A couple of days ago I checked a new module into Sandbox called 
ctakes-dictionary-lookup2 (how novel a name).  It is a complete replacement of 
the current dictionary lookup module, but both can sit side-by-side in your 
local trunk sandbox or build.  It has an example descriptor that tells it to 
read a bar-separated value file (BSV) as a dictionary, storing it (indexed) in 
memory for fast lookup.  There is an example dictionary and xml descriptor for 
that dictionary.  It accepts 2 or 3 column files in the format CUI|Text or 
CUI|TUI|Text.  It automatically detects the number of columns, but they must be 
in that order.  It also does not need the text fields to be tokenized, allowing 
it to accept "Tumor, malignant" as well as "tumor , malignant" as it will 
perform the tokenization upon reading the file.  
As the dictionary will be stored in-memory it should not be huge.  If you do 
have a very large number of terms (>50k) then I recommend an hsql db.  The new 
module will take an hsql db with the fixed field names CUI, TUI, RINDEX, 
TCOUNT, TEXT, RWORD.  I will explain what those mean in some documentation that 
I plan to check into sandbox later today, but I can help you build an hsql 
dictionary db ...
Yesterday I checked into sandbox a project named "dictionarytool".  It is 
source-only, but I can give you a jar if you want one.  Out-of-the-box it will 
build various dictionaries from a UMLS download.  It can build BSV, Hsql (new 
format) and Hsql (current format) to be used by the new or current dictionary 
lookup modules.

This devlist announcement is a little premature on my part.  I will not get 
usage documentation into sandbox for a day or two, but I can send you copies as 
I go if you are in a hurry, or just give you xml snippets for the current 
module descriptors.  If you send the format of your dictionary then that can be 
done quickly.  I just wanted to let you know that there is another option wrt 
dictionary lookup.

Sean

-----Original Message-----
From: Abhishek De [mailto:abhishek...@alumnux.com] 
Sent: Friday, February 28, 2014 6:58 AM
To: dev@ctakes.apache.org
Subject: How to add a new dictionary database to cTAKES

 

Hi, 

How do I add a new database to the cTAKES pipeline to perform lookup from? How 
do I specify what columns to look up and how to annotate the text with the 
returned hits? I have gone through the DictionaryLookupAnnotatorDB.xml and 
LookupDesc_Db.xml files. However, I could not understand the meanings of the 
terms like "lookupField", "metaField", "maxPermutationLevel" and 
"exclusionTags". If I add a new database, I need to configure this xml file 
properly. Please guide me regarding these problems. 

Thanks and Regards, 

Abhishek De

RE: How to add a new dictionary database to cTAKES

Reply via email to