Hello, A couple of people have asked about generating their index and matrix files for vector in the UMLS::Similarity package. I thought a small tutorial might help.
Note: I am assuming you have the UMLS::Interface and UMLS::Similarity packages already installed. Instructions: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Step I. Download and install Text::NSP: http://search.cpan.org/dist/Text-NSP/ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Step II. Use count.pl from Text::NSP to obtain the bigram counts of your desired corpus. The basic command is: count.pl --ngram 2 <destination file> <corpus> There are a couple options you might want to consider when creating this: 1. --stoplist <stoplist file> : there is one in the package. If you need one though just email. 2. --frequency <number> : this is a lower bound cutoff; for example, only use those bigrams that occur more than two times 3. --ufrequency <number> : this is an upper bound cutoff; for example, only use those bigrams that occur less than 1,000 times 4. --newline : this keeps the program from collecting bigrams that cross the newline boundary There are a number of additional options that you can use but those seem to be the basic ones. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Step III: Use the vector-input.pl program from UMLS::Similarity to create the index and matrix files. The basic command is: vector-input.pl <index file> <matrix file> <bigram file> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Using the files: So once you have the index and matrix files created, you can use the in the umls-similarity.pl program using the following options: --vectorindex <index file> --vectormatrix <matrix file> For example: umls-similarity.pl --measure vector --vectorindex <indexfile> --vectormatrix <matrixfile> hand skull ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Defaults: The default vector index and matrix files are built using clinical records. The files are the same for both the command line and the web interface. You can find the files in the lib/UMLS directory of the package. They are called vector-index.dat and vector-matrix.dat. I hope this helps. Please feel free to email if you have any questions! Thanks, Bridget
