Hi Martin, I'm not sure this is the right one, but I use Lucene index as a corpus database. Because Lucene provides powerful Analyzers, we can use them to normalize text (A -> a, ß -> ss, 廣 -> 広,...). Once you normalize text, those normalized words are recorded into Lucene index. Since Lucene provides API to access index (word database) as well, we can get basic stats info such as word counts, N-gram counts, etc.
I'm now working on NLP4L (NLP for Lucene) project [1]. It has connectivities for Mahout and Spark now, but I think it could have one for OpenNLP, too. If you have any ideas on our project for OpenNLP, please let us know. Thanks, Koji [1] https://github.com/NLP4L/nlp4l On 2015/05/03 21:57, Martin Wunderlich wrote:
Hi all, OpenNLP provides lots of great features for pre-processing and tagging. However, one thing I am missing is a component that works on the higher level of corpus management and document handling. Imagine, for instance, if you have raw text that is sent through different pre-processing pipelines. It should be possible to store the results in some intermediate format for future processing, along with the configuration of the pre-processing pipelines. Up until now, I have been writing my own code for this for prototyping purposes, but surely others have faced the same problem and there are useful solutions out there. I have looked into UIMA, but it has a relatively steep learning curve. What are other people using for corpus management? Thanks a lot. Cheers, Martin
