Mark & Michael & Others,
The current models where trained using old annotated news articles and
are really used as useful examples. They were never meant to be
complete or otherwise in training. The copyright issues are complicated
but in a nutshell the owners of the corpuses that where used allow us to
use the generated data for educational and research purposes only in
most cases. This means that commercial use is strictly forbidden by the
copyright holders, never mind the fact you can't generate the original
or produce the material from the models. I know it sounds like an odd
copyright, and some models may be a bit more leanient on the details of
the copyright.
The corpuses where generated by people doing research and other tasks
via the CONLL and other projects to train models to detect POS, NER, and
other types of pre-processing of textual data over the years. Most of
these have continual yearly or biyearly projects to do additional work
in these areas. OpenNLP isn't directly involved in these (to my
knowledge... I'm sure to get some bad press on this). But, the goals of
the project are to get a set of training and test data to experiment and
research on different model approaches to see if a best model for the
type of parsing/processing/understanding, etc. of the textual data can
be found for the situation.
With an APACHE license, we have to be able to distribute the sources for
the models to be able to align with the license... as such, we have
other side projects setup to research and develop an easier method to
generate and tag the data for the various types of corpus data we need
to train against. But, the catch is the data we gather needs to be FREE
of any legal copyrights... we have found several avenues that seem
promissing in this area.
https://cwiki.apache.org/confluence/display/OPENNLP/OpenNLP+Annotations
We have sources in the sandbox for this and other works in the opennlp
project as well... in progress for the OpenNLP project.
http://svn.apache.org/viewvc/opennlp/sandbox/ [via ViewVC]
https://svn.apache.org/repos/asf/opennlp/sandbox/ [via subversion]
By all means please get involved!
We need people who can read and annotate various languages. We need
people who can test models. We need people who can come up with new
ideas. We have other projects in WIKI for adding support for other
model types other than just maxent. There is also another for using
SORA as the language.
Thanks for lisenning to me,
James Kosin