Re: [umls-similarity] Practical large coverage configuration

Ted Pedersen duluth...@gmail.com [umls-similarity] Tue, 08 Jul 2014 14:58:25 -0700

Hi Chaitanya,

Regarding coverage, I think that's something you need to be selective
about. Some sources are intended for specific domains, and so you'll want
to make sure the sources you choose all "fit" your intended domain.
Different sources can be organized differently and may reflect different
levels of granularity, and so your results (with similarity measurements,
for example) can actually degrade somewhat as you add sources.

In general while we've tried mixing various different sources, I'm not sure
I've run across a case when doing that (yet) where both the coverage and
performance (of similarity measures) really increased significantly. I
think what that tells us is that within a certain domain a source that
focuses on that domain typically has pretty good coverage, and adding other
sources might increase coverage while resulting in a "muddier" view of the
world. I guess my summarizing comment is that increasing coverage with
multiple sources sometimes seems to have a pretty negative effect on
similarity measures.

That said, it's a fascinating question (what are the benefits of mixing
sources) and certainly everything I say above is pretty anecdotal and meant
somewhat casually. But, I think the key is to add sources when you are
pretty sure you need the additional coverage for a particular domain.
Generically increasing coverage may not help performance of similarity
measures much, at least not as far as I've seen.

However, if anyone has more specific experience or comments, I'd be very
interested to hear about that.

Good luck,
Ted

On Tue, Jul 8, 2014 at 3:39 PM, Bridget McInnes btmcin...@gmail.com
[umls-similarity] <umls-similarity@yahoogroups.com> wrote:

>
>
> Hello Chaitanya,
>
> Given the size and number of links between each of the nodes with the
> configuration file that you are using, I would suggest using the --
> realtime option rather than building the index. When I run experiments
> using the entire UMLS this is usually what I do because of space and time
> issues. The --realtime option will calculate the path information between
> the concepts on the fly rather running a DFS through the taxonomy and 
> pre-storing
> path information in an index.
>
> For the paper: Pedersen, T., Pakhomov, S. V. S., Patwardhan, S., & Chute,
> C. G. (2007). Measures of semantic similarity and relatedness in the
> biomedical domain. These experiments were done on SNOMEDCT prior to its
> inclusion in the UMLS and prior to the creation of the UMLS-Similarity
> package. To reduplicate those experiments in a subsequent paper (
> http://www-users.cs.umn.edu/~bthomson/publications/btmcinnes-amia2009.pdf),
> we used:
>
> PAR :: include SNOMECT
> REL :: include PAR, CHD
>
> I hope this helps!
>
> Let us know if you have any additional questions or something isn't clear!
>
> Best regards,
>
> Bridget
>
>
> On Tue, Jul 8, 2014 at 2:30 PM, chaitanyapshiv...@yahoo.co.in
> [umls-similarity] <umls-similarity@yahoogroups.com> wrote:
>
>>
>>
>> Hi
>>
>>
>> I had some questions related to indexing and configurations.
>>
>>
>> I have tried running UMLS::Similarity with the following configuration:
>>
>>
>> SAB :: include MSH, RXNORM, ICD9CM, NCI, SNOMEDCT_US
>> REL :: include PAR, CHD
>>
>> I was running the indexing on a fairly powerful machine. (16 core CPU
>> with 64G RAM). I let the indexing run for a week and it occupied more
>> than 500G but was still running. This is understandable considering the
>> number of sources i have added is large and that the graph size would grow
>> exponentially.
>>
>> From previous threads i understand SNOMEDCT takes a day. I can definitely
>> afford running it more if i can add more sources.
>>
>> I wish to have more coverage of concepts and hence wish to add more
>> sources. What is the best compromise to achieve more sources within a
>> reasonable amount of time ?
>>
>> Also what is the exact configuration used for the paper
>> Pedersen, T., Pakhomov, S. V. S., Patwardhan, S., & Chute, C. G. (2007).
>> Measures of semantic similarity and relatedness in the biomedical domain.
>>
>> Your input would be very helpful.
>>
>> Chaitanya.
>>
>>
>>
>  
>

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [umls-similarity] Practical large coverage configuration

Reply via email to