(Note: this might be best considered as a draft of material to possibly be included in the FAQ or other documentation)
There are a few combinations of parameters in discriminate.pl that merit some comment/warning, in that they won't necessarily make sense (even though discriminate.pl allows you to use them). If you use the --stat option to select a test of association for identifying bigram or co-occurrence features, you really should use the --stat_rank or --stat_score to select a subset of the bigrams or co-occurrences that are identified. If you do not, the effect will be the same as if you didn't use the --stat option at all, since all features will be selected regardless of their --stat score. --stat_rank N allows you to select the top N features based on the --stat_score M, and stat_score allows you to select features with a score greater than M. Thus, --stat without --stat_rank or --stat_score reduces to a frequency based selection of features. Also note that if you use --remove N with --stat (but not using --stat_score or --stat_rank) it has the same effect as simply removing features that occur less than N times, and again reduces to a frequency based cutoff. However, using --stat --stat_rank and --reduce is a reasonable thing to do. Suppose you have --stat ll --stat_rank 50 --reduce 4. This would mean that you want to select all the bigrams that occur more than 4 times and are ranked 50 and above by the log-likelihood ratio. The --scope_train R and --window S options are related, in that --scope_train specifies the size of the context from which features are selected, and --window specifies how far apart two words can be and still be considered bigrams or co-occurrences. Note that --window size of 2 means that the pair of words must be adjacent, and that --window size of 3 means that one intermediate word may be present (and will be ignored), and so on. Now, --scope_train 1 means that the "window of context" around the target word is 1 word to the left and right, or a 3 word window (if you count the target word). Thus, S (the --window size) should always be set to a value less than or equal to (2*scope_train) + 1. Note that if --window is greater than this, it won't actually be able to find bigrams and co-occurrence with as many intervening words as such a window size would suggest. Also note that --window must always be 2 or greater (since the window size includes the two words that make up the bigram / co-occurrence. For example, if --window is 4 and --scope_train is 1, it will be impossible to find bigrams/co-occurrences with 2 intermediate words between them, since the scope of the training data is limited to a 3 word window (1 word to the left and right). Now, the --window 4 option will also consider window sizes of 3 and 2 as well, so the effect is the same as if you specified --window of 3. Finally, --scope_test has no relation to --window or --scope_train. --scope_test is used during clustering to determine how the context of an instance to be clustered should be represented. If --scope_test is 5, it means that the vectors of each of the words in a window 5 words to the left and right of the target word will be averaged to create a vector to represent that context. --window and --scope train only affect feature selection, of if you specificy --scope_test 2 --scope_train 4 --window 3 This means that you will select features (bigrams or co-occurrences) from a window of context that stretches 4 words to the left and to the right of the target word. Within that context, we wil allow those features to be adjacent or have one intervening word between them. This will create the set of features that we will use to represent the contexts that we wish to cluster. Now, when it comes to clustering, we will build a vector based on the 5 words found in a window 2 words to the left and right of the target word, and that includes the target word. A few things to think about with --scope_test. Remember that this specifies a fixed size around the target word, and that there is no guarantee that all the words in that scope will have a vector associated with them. So you might not have a vector for every word in the scope of the test instance. Also, keep in mind that the target word is represented by a vector too. This might be particularly useful if a target word has different morphological forms (buy or buying), or if we have candidate synonyms (fix and repair) as target words. Now, in some cases it *might* make sense for discriminate.pl to advise / help the user when they use an ineffective combination of options, like using --stat without --stat_rank or --stat_score, or setting window too high for --scope_train, as in --window > (--scope_train*2) + 1. At this point though, our main objective is to let users know about these situations so they can be avoided. Well, I hope this is useful, and that I haven't said anything wrong or unclearly. Please let me know if there are any questions or comments on the above. -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ senseclusters-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
