Dear George:

As explained in one of the previous messages, you must distinguish the
prior probability of a sentence, P(sentence), from the probability of a
certain parse tree given a sentence P(parse-tree|sentence). 

Given that each tree generates only one sentence,
P(sentence|parse-tree-i)=1. If a sentence can be generated by only one
parse tree, then P(parse-tree|sentence)=1. 

If a sentence can be generated by two parse trees, the probability of
each tree given that sentence is, according with Bayes theorem, 

P(parse-tree-1|sentence) = P(sentence|parse-tree-1) x P(parse-tree-1) / 
   [P(sentence|parse-tree-1) x P(parse-tree-1) 
        + P(sentence|parse-tree-2) x P(parse-tree-2)] 
   = P(parse-tree-1) / [P(parse-tree-1) + P(parse-tree-2)]

> The question relates to the conditional probability assigned to a rule,
> based on the training data, and it can be shown by the following very
> simple example: Given that my grammar contains many rules of the
> form "NOUN -> X", where X various nouns, among which the noun "play",
> the rule "VERB -> play", as well as many other rules. Assume also that
> the training set contains 1 instance of the verb "play" and many more
> (say 10) instances of the noun "play". The probability attached to the
> rule "VERB -> play" in a "generative" manner is 1.0, 

This is true only is there is no other rule of type "VERB -> some-verb".

> while the one
> attached to the rule "NOUN -> play" is less than 1.0.
> comes to parsing a sentence containing this token, if there is no
> other way of selecting between the two rules (i.e., using the rest of
> the grammar), we would say that the word is a verb, despite the fact
> that we have seen many more noun instances. Doesn't this seem odd?

Again you must distinguish between P("play"|NOUN) and P(NOUN|"play"). In
this example, in which the context does not indicate you whether the
word "play" is a noun or a verb,  

P(NOUN|"play") = P("play"|NOUN) x P(NOUN) /
     [P("play"|NOUN) x P(NOUN) + P("play"|VERB) x P(VERB)] .

According with your data set, P(NOUN) is more than ten times bigger than
P(VERB), and for this reason P(NOUN|"play") > P(VERB|"play"), despite
the fact that P("play"|NOUN) < P("play"|VERB).

> Would it be better to condition on the appearance of the body, rather
> than the head of the rule?

I don't know exactly what you mean. Each of the probabilities you have
in your grammars are the probability that a non-terminal (left-hand
side, which I assume it what you call head) is replaced by a certain
combination of terminal and non-terminals (right-hand side, the body).
Then you have the probability of the body given the head.

> A slightly different and more complex, though still artificial, example
> is the following:
> 
> Given a set of training sentences:
> Fat people eat accumulates.
> Fat people eat often.
> Animal fat is harmful.
> Eat!
> Leave!
> 
> and the induced grammar:
> S -> NP VB often      freq=1   p=0.16
> S -> NP VB harmful    freq=1   p=0.16
> S -> NP VB            freq=1   p=0.16
> S -> NP S VB          freq=1   p=0.16
> S -> VB               freq=2   p=0.33

The sentence "Fat people eat accumulates" can be generated by two
different parse trees. For ambigous grammars, the probability of a tree
is not the same as the probability of a sentence. Apparently you are
interested in the probability of trees, but you are computing the
probability sentences. If you want to obtain the probability of a rule,
you should use a set of trees, not a set of sentences.

You must also reconsider the assumption that the sentences in your
dataset have been randomly generated from a grammar by selecting at each
step one of the available replacements for a terminal. 

In my opinion, you must have a careful definition of (the meaning of)
your probabilities, and then build a coherent model that leads to those
probabilities. Otherwise you will encounter lots of paradoxes and
inconsistencies.

Regards
  Javier D�ez

--------------------------------------------------------------------
F. J. Diez                       Phone: +34-91-3987161
Dpto. Inteligencia Artificial    Fax:   +34-91-3986697
UNED. Senda del Rey, 9           E-mail: [EMAIL PROTECTED]
28040 Madrid. Spain              WWW: http://www.ia.uned.es/~fjdiez


Reply via email to