Hello,

A couple of people have asked about generating their index and matrix files
for vector in the UMLS::Similarity package. I thought a small tutorial
might help.

Note: I am assuming you have the  UMLS::Interface and UMLS::Similarity
packages already installed.

Instructions:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Step I. Download and install Text::NSP:
http://search.cpan.org/dist/Text-NSP/

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step II. Use count.pl from Text::NSP to obtain the bigram counts of your
desired corpus.

The basic command is:  count.pl --ngram 2 <destination file> <corpus>

There are a couple options you might want to consider when creating this:

1. --stoplist <stoplist file> : there is one in the package. If you need
one though just email.

2. --frequency <number> : this is a lower bound cutoff; for example, only
use those bigrams that occur more than two times

3. --ufrequency <number> : this is an upper bound cutoff; for example, only
use those bigrams that occur less than 1,000 times

4. --newline : this keeps the program from collecting bigrams that cross
the newline boundary

There are a number of additional options that you can use but those seem to
be the basic ones.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Step III: Use the vector-input.pl program from UMLS::Similarity to create
the index and matrix files.

The basic command is: vector-input.pl <index file> <matrix file> <bigram
file>

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Using the files:

So once you have the index and matrix files created, you can use the in the
umls-similarity.pl program using the following options:

--vectorindex <index file>
--vectormatrix <matrix file>

For example:

umls-similarity.pl --measure vector --vectorindex <indexfile>
--vectormatrix <matrixfile> hand skull

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Defaults:

The default vector index and matrix files are built using clinical records.
The files are the same for both the command line and the web interface.
You can find the files in the lib/UMLS directory of the package. They are
called vector-index.dat and vector-matrix.dat.

I hope this helps. Please feel free to email if you have any questions!

Thanks,

Bridget

Reply via email to