OpenNLP team:

I'm trying to work out how to train a Parser model with OpenNLP. I see that
I need to acquire a body of training data in OpenNLP format, which the docs
suggest is basically Penn Treebank format, with one sentence per line. OK,
this part is fine. The rub is, the "real" PTB data is hidden away by the
gatekeeping / rent-seeking Linguistic Data Consortium and for my purposes
is effectively unavailable.

"Fine", I find myself thinking, I can get some data elsewhere, even if it
means annotation my own from raw text. Or maybe I can borrow bits from
something like the The Treebank Semantics Parsed Corpus.  But here's where
my question comes in:

In the couple of examples of the training data format in the OpenNLP docs,
we see stuff like this:

(TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) ))
(TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))

In the manual, we are referred to <
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html>
for info on the annotations, but by and l large what is shown on this page
does not match the supplied example. For example, there is no mention of
annotations "TOP", "S", "NP-SBJ", and so on. This leads me to question if I
can locate useful pre-existing data, or transform data programmatically, or
whatever, which I can then use for OpenNLP.

Is there anything I can look at (besides digging in to the source for the
Parser, which I will resort to if it comes to that) to help me understand
exactly what the training data needs to look like? Maybe a slightly larger
sample of a known good training data file?

The thing is, I don't need much data, because my real goal is not a
complete parser for generic English. I want something about half a step
above "toy model" just so I can do experiments with the mapping from the
syntactically parsed test, to my notions of a corresponding semantic model.


Thanks for any and all help!


Phil
~~~
This message optimized for indexing by NSA PRISM

Reply via email to