Re: Next Steps for OpenNLP

James Kosin Tue, 01 Oct 2013 20:01:17 -0700

Mark & Michael & Others,

The current models where trained using old annotated news articles andare really used as useful examples. They were never meant to becomplete or otherwise in training. The copyright issues are complicatedbut in a nutshell the owners of the corpuses that where used allow us touse the generated data for educational and research purposes only inmost cases. This means that commercial use is strictly forbidden by thecopyright holders, never mind the fact you can't generate the originalor produce the material from the models. I know it sounds like an oddcopyright, and some models may be a bit more leanient on the details ofthe copyright.

The corpuses where generated by people doing research and other tasksvia the CONLL and other projects to train models to detect POS, NER, andother types of pre-processing of textual data over the years. Most ofthese have continual yearly or biyearly projects to do additional workin these areas. OpenNLP isn't directly involved in these (to myknowledge... I'm sure to get some bad press on this). But, the goals ofthe project are to get a set of training and test data to experiment andresearch on different model approaches to see if a best model for thetype of parsing/processing/understanding, etc. of the textual data canbe found for the situation.

With an APACHE license, we have to be able to distribute the sources forthe models to be able to align with the license... as such, we haveother side projects setup to research and develop an easier method togenerate and tag the data for the various types of corpus data we needto train against. But, the catch is the data we gather needs to be FREEof any legal copyrights... we have found several avenues that seempromissing in this area.

https://cwiki.apache.org/confluence/display/OPENNLP/OpenNLP+Annotations

We have sources in the sandbox for this and other works in the opennlpproject as well... in progress for the OpenNLP project.

    http://svn.apache.org/viewvc/opennlp/sandbox/    [via ViewVC]
    https://svn.apache.org/repos/asf/opennlp/sandbox/    [via subversion]

By all means please get involved!

We need people who can read and annotate various languages. We needpeople who can test models. We need people who can come up with newideas. We have other projects in WIKI for adding support for othermodel types other than just maxent. There is also another for usingSORA as the language.


Thanks for lisenning to me,
James Kosin

Re: Next Steps for OpenNLP

Reply via email to