[Senseclusters-users] thoughts on combinations of parameters in discriminate.pl (fwd)

ted pedersen Sun, 24 Oct 2004 07:52:02 -0700

(Note: this might be best considered as a draft of material to possibly be
included in the FAQ or other documentation)


There are a few combinations of parameters in discriminate.pl that merit
some comment/warning, in that they won't necessarily make sense (even
though discriminate.pl allows you to use them).

If you use the --stat option to select a test of association for
identifying bigram or co-occurrence features, you really should use the
--stat_rank or --stat_score to select a subset of the bigrams or
co-occurrences that are identified. If you do not, the effect will be the
same as if you didn't use the --stat option at all, since all features
will be selected regardless of their --stat score. --stat_rank N allows
you to select the top N features based on the --stat_score M, and
stat_score allows you to select features with a score greater than M.
Thus, --stat without --stat_rank or --stat_score reduces to a frequency
based selection of features.

Also note that if you use --remove N with --stat (but not using
--stat_score or --stat_rank) it has the same effect as simply removing
features that occur less than N times, and again reduces to a frequency
based cutoff. However, using --stat --stat_rank and --reduce is a
reasonable thing to do.

Suppose you have --stat ll --stat_rank 50 --reduce 4. This would mean that
you want to select all the bigrams that occur more than 4 times and
are ranked 50 and above by the log-likelihood ratio.

The --scope_train R and --window S options are related, in that
--scope_train specifies the size of the context from which features are
selected, and --window specifies how far apart two words can be and still
be considered bigrams or co-occurrences. Note that --window size of 2
means that the pair of words must be adjacent, and that --window size of 3
means that one intermediate word may be present (and will be ignored), and
so on. Now, --scope_train 1 means that the "window of context" around the
target word is 1 word to the left and right, or a 3 word window (if you
count the target word). Thus, S (the --window size) should always be set
to a value less than or equal to (2*scope_train) + 1. Note that if
--window is greater than this, it won't actually be able to find bigrams
and co-occurrence with as many intervening words as such a window size
would suggest. Also note that --window must always be 2 or greater (since
the window size includes the two words that make up the bigram /
co-occurrence.

For example, if --window is 4 and --scope_train is 1, it will be
impossible to find bigrams/co-occurrences with 2 intermediate words
between them, since the scope of the training data is limited to a 3 word
window (1 word to the left and right). Now, the --window 4 option will
also consider window sizes of 3 and 2 as well, so the effect is the same
as if you specified --window of 3.

Finally, --scope_test has no relation to --window or --scope_train.
--scope_test is used during clustering to determine how the context of an
instance to be clustered should be represented. If --scope_test is 5, it
means that the vectors of each of the words in a window 5 words to the
left and right of the target word will be averaged to create a vector to
represent that context. --window and --scope train only affect feature
selection, of if you specificy

--scope_test 2 --scope_train 4 --window 3

This means that you will select features (bigrams or co-occurrences) from
a window of context that stretches 4 words to the left and to the right of
the target word. Within that context, we wil allow those features to be
adjacent or have one intervening word between them. This will create the
set of features that we will use to represent the contexts that we wish
to cluster. Now, when it comes to clustering, we will build a vector based
on the 5 words found in a window 2 words to the left and right of the
target word, and that includes the target word.

A few things to think about with --scope_test. Remember that this
specifies a fixed size around the target word, and that there is no
guarantee that all the words in that scope will have a vector associated
with them. So you might not have a vector for every word in the scope of
the test instance. Also, keep in mind that the target word is represented
by a vector too. This might be particularly useful if a target word has
different morphological forms (buy or buying), or if we have candidate
synonyms (fix and repair) as target words.

Now, in some cases it *might* make sense for discriminate.pl to advise /
help the user when they use an ineffective combination of options, like
using --stat without --stat_rank or --stat_score, or setting window too
high for --scope_train, as in --window > (--scope_train*2) + 1. At this
point though, our main objective is to let users know about these
situations so they can be avoided.

Well, I hope this is useful, and that I haven't said anything wrong or
unclearly. Please let me know if there are any questions or comments on
the above.

 --
Ted Pedersen
http://www.d.umn.edu/~tpederse



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
senseclusters-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] thoughts on combinations of parameters in discriminate.pl (fwd)

Reply via email to