Hi Jennifer, Mapping terms to CUIs is it's own problem, and there are a few nice tools already available that might be of some use. We've used MetaMap to good effect for this problem, so you might want to consider looking there.
https://metamap.nlm.nih.gov/ I'd be curious if other users have recommendations as well.. Good luck, Ted On Fri, Jun 2, 2017 at 7:56 PM, Jennifer Wilson jen.wilson...@gmail.com [umls-similarity] <umls-similarity@yahoogroups.com> wrote: > > > Hi Ted, > > Thank you again for all of this. I'm sorry I had to put down this project > for a few days and am only now getting back to it. > > I see that ontologies change and reproducing that result might not be the > best sanity check on the scripts that I wrote. > > I'm going to try and figure out how to map to CUI terms and I'll be in > touch if I get stuck again. Thanks, > > On Sun, May 28, 2017 at 10:59 AM, Ted Pedersen duluth...@gmail.com > [umls-similarity] <umls-similarity@yahoogroups.com> wrote: > >> >> >> This is perhaps a bit more than you were looking for, but there are quite >> a few command line tools available with UMLS::Similarity when you install >> locally that can be helpful for digging into situations like this. When I >> look for the path from each of these CUIs to the ROOT (of MSH) I find that >> one of them does not have a path to the root, while the other does (see >> command output below) >> >> The lack of a path to the root is going to cause a lot of measures to >> report a -1 value (since path, for example, relies on finding this path as >> a part of its computation). In fact, not having a path to the root makes me >> question if C0156543 is in MSH at all, so it might even be that the CUI is >> no longer a part of MSH (and not just lacking a path to the root). But, >> regardless, clearly something has changed since 2009 that is causing this >> measure to return a different value. This happens in some cases since UMLS >> continues to evolve and CUIs are added, removed, etc. It's important to >> know what version of the UMLS a previous study has used if you are >> interested in getting a very exact comparison. In the case of our AMIA 2009 >> paper we used 2008AB, so things have no doubt changed a bit since then. >> >> tpederse@maraca:~$ findPathToRoot.pl C0156543 >> >> UMLS-Interface Configuration Information: >> (Default Information - no config file) >> >> Sources (SAB): >> MSH >> Relations (REL): >> PAR >> CHD >> >> Sources (SABDEF): >> UMLS_ALL >> Relations (RELDEF): >> UMLS_ALL >> >> >> There are no paths from the given C0156543 to the root. >> tpederse@maraca:~$ findPathToRoot.pl C0000786 >> >> >> UMLS-Interface Configuration Information: >> (Default Information - no config file) >> >> Sources (SAB): >> MSH >> Relations (REL): >> PAR >> CHD >> >> Sources (SABDEF): >> UMLS_ALL >> Relations (RELDEF): >> UMLS_ALL >> >> >> The paths between abortions, spontaneous (C0000786) and the root: >> => C0000000 (**UMLS ROOT**) C1135584 (mesh headings) C1256739 (mesh >> descriptors) C1256741 (topical descriptor) C0012674 (diseases (mesh >> category)) C1720765 (female urogenital dis pregnancy compl) C0032962 (compl >> pregn) C0000786 (abortions, spontaneous) >> >> >> On Sun, May 28, 2017 at 12:43 PM, Ted Pedersen <duluth...@gmail.com> >> wrote: >> >>> Hi Jennifer, >>> >>> Thanks for sharing this question. I think in general if you have a >>> choice between using CUIs or terms with UMLS::Similarity, your best option >>> is to use the CUIs. Terms can map to multiple CUIs, and UMLS::Similarity >>> might pick a CUI associated with a sense of the term you aren't intending. >>> Also, if you misspell a term or don't specify it exactly correctly, then it >>> shows up as not found. One useful resource for replicating similarity >>> measure studies (like the one you cite) is the following page which >>> includes term mappings for several of the datasets we've worked with over >>> the years. >>> >>> http://www-users.cs.umn.edu/~bthomson/corpus/corpus.html >>> >>> I will admit to being a little puzzled about the case of abortion - >>> miscarriage. The paper you cite clearly reports a value based on MSH, but >>> as I try to run that query now I get a value of -1 (even when using the >>> CUIs). However, it appears that each of the CUIs is found in MSH, but that >>> somehow we are not able to compute some of the measures (a path length, for >>> example). This suggests that there is not a path between the two CUIs, >>> which has something to do with the structure of UMLS/MSH. >>> >>> One quick and dirty way to see if a CUI is in MSH is to find the path >>> length between a CUI and itself. If it is present in MSH, that value will >>> be 1. We see that for each of the CUIs used for abortion and miscarriage. >>> >>> tpederse@maraca:~$ perl query-umls-similarity-webinterface.pl --measure >>> path --sab MSH C0156543 C0156543 >>> Default Settings: >>> --default http://atlas.ahc.umn.edu/ >>> --rel PAR/CHD >>> User Settings: >>> --measure path >>> >>> 1<>Unspecified abortion NOS(C0156543)<>Unspecified abortion NOS(C0156543) >>> >>> tpederse@maraca:~$ perl query-umls-similarity-webinterface.pl --measure >>> path --sab MSH C0000786 C0000786 >>> Default Settings: >>> --default http://atlas.ahc.umn.edu/ >>> --rel PAR/CHD >>> User Settings: >>> --measure path >>> >>> 1<>Abortions.spontaneous(C0000786)<>Abortions.spontaneous(C0000786) >>> >>> However, when I try to find the path length between the two CUIs, I get >>> -1. This suggests that the CUIs are not jointed by PAR/CHD relations...note >>> that below you can see that the terms for the CUIs have been looked up, >>> which shows us that MSH knows about them... >>> >>> tpederse@maraca:~$ perl query-umls-similarity-webinterface.pl --measure >>> path --sab MSH C0156543 C0000786 >>> Default Settings: >>> --default http://atlas.ahc.umn.edu/ >>> --rel PAR/CHD >>> User Settings: >>> --measure path >>> >>> -1<>Unspecified abortion NOS(C0156543)<>Abortions.spontaneous(C0000786) >>> >>> So, in any case, it would appear that something has changed in the >>> structure of MSH since we reported our results in the 2009 AMIA paper you >>> mention. I'm not sure what that is. But, I think the general message is >>> that if you can use CUIs it will normally be more reliable to do that. >>> Mapping terms to CUIs is of course it's own problem, but UMLS::Similarity >>> doesn't do anything terribly fancy with that, and so probably whatever you >>> do will be more extensive and reliable than what UMLS::Similarity would >>> do... >>> >>> I hope this helps somehow, and please do feel free to follow up. >>> Thoughts from other users on this issue would also be most welcome! >>> >>> Cordially, >>> Ted >>> >>> On Sat, May 27, 2017 at 12:18 PM, Jennifer Wilson >>> jen.wilson...@gmail.com [umls-similarity] <umls-similarity@yahoogroups. >>> com> wrote: >>> >>>> >>>> >>>> Hi all, >>>> >>>> I'm resending this now that I'm subscribed. Any advice would be much >>>> appreciated! Thank you, >>>> >>>> ---------- Forwarded message ---------- >>>> From: Jennifer Wilson <jen.wilson...@gmail.com> >>>> Date: Tue, May 23, 2017 at 6:13 PM >>>> Subject: Help with the best approach for using the query-UMLS interface >>>> To: umls-similarity@yahoogroups.com >>>> >>>> >>>> Hello UMLS similarity team, >>>> >>>> I am trying to compute the similarity between ~30K disease/phenotype >>>> terms. Ideally, I would have a matrix of similarity for these terms. >>>> >>>> My first attempt was to write a python script to call the >>>> query-umls-similarity-webinterface.pl script. Though, before releasing >>>> the script on my dataset, I was trying to recreate the scores from this >>>> paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2815481/) in table >>>> 1. >>>> >>>> Here's the command I am using: >>>> >>>> ./query-umls-similarity-webinterface.pl --sab MSH --rel PAR/CHD >>>> "Abortion" "Miscarriage" >>>> >>>> Default Settings: >>>> >>>> --default http://atlas.ahc.umn.edu/ >>>> >>>> --measure path >>>> >>>> >>>> User Settings: >>>> >>>> --rel PAR/CHD >>>> >>>> >>>> (-1.0, 'Abortion', 'Miscarriage') >>>> >>>> I also have not processed the text in my dataset much. I have basically >>>> pulled diseases and phenotypes from DisGeNet, OMIN, PheWas, and the GWAS >>>> catalogue. If I'm using data from all of these sources - do you recommend >>>> sending them directly to the query interface? Should I try and map to CUI >>>> terms? (examples below) >>>> >>>> Before I download the database and attempt to query the database (it's >>>> not a language that I use in my current work), I just wanted an outside >>>> perspective to see if there are best practices for using this data. Thank >>>> you in advance for your time! >>>> >>>> (examples) >>>> Here are two more examples showing the disease descriptions in my >>>> dataset. Is the UMLS interface robust to these various formats or do they >>>> need to be an exact match? >>>> >>>> ./query-umls-similarity-webinterface.pl --sab MSH --rel PAR/CHD >>>> "Testicular Neoplasms" "Amelogenesis imperfecta local hypoplastic form" >>>> >>>> Default Settings: >>>> >>>> --default http://atlas.ahc.umn.edu/ >>>> >>>> --measure path >>>> >>>> >>>> User Settings: >>>> >>>> --rel PAR/CHD >>>> >>>> >>>> (-1.0, 'Testicular Neoplasms', 'Amelogenesis imperfecta local >>>> hypoplastic form') >>>> >>>> >>>> >>>> ./query-umls-similarity-webinterface.pl --sab MSH --rel PAR/CHD >>>> "Hypotrichosis 2, 146520 (3)" "PERIODONTITIS, LOCALIZED AGGRESSIVE" >>>> >>>> Default Settings: >>>> >>>> --default http://atlas.ahc.umn.edu/ >>>> >>>> --measure path >>>> >>>> >>>> User Settings: >>>> >>>> --rel PAR/CHD >>>> >>>> >>>> (-1.0, 'Hypotrichosis 2, 146520 (3)', 'PERIODONTITIS, LOCALIZED >>>> AGGRESSIVE') >>>> >>>> >>>> >>>> -- >>>> Jennifer L. Wilson >>>> Bioengineering, Stanford University >>>> jen.wilson...@gmail.com / 703.969.3318 <(703)%20969-3318> >>>> >>>> >>>> >>>> -- >>>> Jennifer L. Wilson >>>> Bioengineering, Stanford University >>>> jen.wilson...@gmail.com / 703.969.3318 <(703)%20969-3318> >>>> >>>> >>> >> > > > -- > Jennifer L. Wilson > Bioengineering, Stanford University > jen.wilson...@gmail.com / 703.969.3318 <(703)%20969-3318> > -- > Jennifer L. Wilson > Bioengineering, Stanford University > jen.wilson...@gmail.com / 703.969.3318 <(703)%20969-3318> > > >