OpenNLP team: I'm trying to work out how to train a Parser model with OpenNLP. I see that I need to acquire a body of training data in OpenNLP format, which the docs suggest is basically Penn Treebank format, with one sentence per line. OK, this part is fine. The rub is, the "real" PTB data is hidden away by the gatekeeping / rent-seeking Linguistic Data Consortium and for my purposes is effectively unavailable.
"Fine", I find myself thinking, I can get some data elsewhere, even if it means annotation my own from raw text. Or maybe I can borrow bits from something like the The Treebank Semantics Parsed Corpus. But here's where my question comes in: In the couple of examples of the training data format in the OpenNLP docs, we see stuff like this: (TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) )) (TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') )) In the manual, we are referred to < https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html> for info on the annotations, but by and l large what is shown on this page does not match the supplied example. For example, there is no mention of annotations "TOP", "S", "NP-SBJ", and so on. This leads me to question if I can locate useful pre-existing data, or transform data programmatically, or whatever, which I can then use for OpenNLP. Is there anything I can look at (besides digging in to the source for the Parser, which I will resort to if it comes to that) to help me understand exactly what the training data needs to look like? Maybe a slightly larger sample of a known good training data file? The thing is, I don't need much data, because my real goal is not a complete parser for generic English. I want something about half a step above "toy model" just so I can do experiments with the mapping from the syntactically parsed test, to my notions of a corresponding semantic model. Thanks for any and all help! Phil ~~~ This message optimized for indexing by NSA PRISM