> 
> Hi,
> 
> I find this project very interesting. But I'd like to check how you define 
> contexts.
> 
> Am I right in thinking that the context "features" which are used to construct 
> context vectors and then similarity matrices are bigram word pairs only, 
> albeit bigrams defined flexibly (by the NSP package) to range over the space 
> of a sampling window?
> 
> I think I do something very similar, but I found for my purposes it has been 
> necessary to have both the preceding and following context in a single 
> calculation to get good results (perhaps because I relate both single word 
> and multiple word "tokens").
> 
> -Rob Freeman
> 

Thanks for your comments Rob. I will try and answer, and then I'm sure
Amruta can give us a definitive answer on what we do and don't do.

We define context to be pretty much whatever is in beteween the <context>
and </context> tags in your input instance. So this could include a single
sentence, as in :

Instance 1:

<context>
That <head>dog</head> ate his food like an animal.
</context>

or quite a bit more...

Instance 2:

<context>
I remember my brother from quite long ago.
His <head>dog</head> was a big black lab. 
The dog loved Sam quite a lot, and Sam loved him too.
</context>

Right now there are two different schemes of clustering, represented by
the driver or helper scripts order1.sh and order2.sh. In order1.sh, we
support Ngrams in general, that is unigrams, bigrams, trigrams, etc. In
order2.sh we are restricted (for now) to bigrams. However, do note that
the Ngrams in general do not need to be consecutive, their may be
intervening words that are ignored. So you can use the bigram features
as "pure" bigrams, or more like co-occurrence features.

The distinction between order1 and order2 explains to some extent why
things are set up like this. Order1 looks for the features (unigram,  
bigram, trigram, etc.) in the context, and builds a feature vector that  
represents whether or not each of the desired features occurs.

So if we had trigram features that were identified by NSP as being somehow 
interesting...

remember my brother : F1
Sam loved him : F2
ate his food : F3

Then the two instances instance above would be represented as 

             F1 F2 F3
Instance1    0   0  1
Instance2    1   1  0

Then a similarity matrix can be constructed, or we could simply
cluster these vectors directy. The important point is that the features
(Fx) can be unigrams, bigrams, trigrams, etc. and with appropriate 
tokenization (which both NSP and SenseClusters supports) those can include  
part of speech tags or other information. 

order2 style clustering only supports bigrams right now. However, it uses  
the bigrams in a much different way. (If you are familiar with Schutze,  
1998, what we are doing is quite similar to that.)

First, a word matrix is created, where each word in a corpus is   
represented as both a row and a column entry. The word matrix contains the 
counts of the number of times those two words occur together in the
corpus (and that is where the bigrams come into the picture). In effect
the row and column in the word matrix represents a bigram. 

Then, once the word matrix is constructed, each context is represented  
as the averaged vector of all the vectors of all the words that occur in 
that context. And each word vector comes from that matrix. Ultimately  
each instance is represented by a single vector, that is the average of  
all the word vectors that make up the context.

The important point with respect to feature types supported in order2
is that we are working to include ngram support. This will mean that
the word matrix can be made up of Ngrams (how often do these two Ngrams
occur together in the same context).

As an example of what this might allow (that is not currently supported)
you could have bigrams represented "George Bush" and "Tony Blair", then
if a context included both of them, this would be represented in the
word matrix (which we would need to rename as an Ngram matrix I guess).
Then George Bush and Tony Blair would ultimately have a vector associated
with them that would be used to to reprsent each of them when they occur
in a context that is being clustered.

We have not yet incorporated the Ngram support for order2 clustering,
but it is our next immediate priority, so if there are any questions, 
requests in that regard it is a good time to raise those.

I hope this makes a bit of sense, and I am sure Amruta will be able to
clarify anything that I might have misstated. 

Thanks again, the feedback is much appreciated!
Ted

--
# Ted Pedersen                              http://www.umn.edu/~tpederse #
# Department of Computer Science                        [EMAIL PROTECTED] #
# University of Minnesota, Duluth                                        #
# Duluth, MN 55812                                        (218) 726-8770 #



-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
senseclusters-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to