Cool! Yeah, Tika has one also. Now for the annoying use case: older web sites and pre-web text in Southeast Asia and India/Pakistan are written in phonetic USASCII. (They only had that technology available. Does anybody do classification on that kind of text?
On Tue, Apr 24, 2012 at 7:17 AM, Jason Baldridge <[email protected]> wrote: > Naive Bayes, perceptron variants (incl passive agressive), faster training > for maxent, and a better overall architecture. These are things my students > and I are working on independently, and I will bring in to OpenNLP when > time frees up to do so. > > On Tue, Apr 24, 2012 at 2:26 AM, Jörn Kottmann <[email protected]> wrote: > >> What are you planning to add? >> >> Jörn >> >> >> On 04/24/2012 03:53 AM, Jason Baldridge wrote: >> >>> FWIW, there will be more classification capabilities coming in the next >>> several months. >>> >>> -Jason >>> >>> On Mon, Apr 23, 2012 at 5:12 PM, Jörn Kottmann<[email protected]> >>> wrote: >>> >>> OpenNLP is using either a Maxent or Perceptron classifier >>>> to classify a piece of text. This can give you back the provabilities >>>> for the various categories, but its not designed to tell you how >>>> much each topic is represented in your input document. >>>> >>>> You could take a document and assume each paragraph has one topic >>>> and then classify it paragraph by paragraph. >>>> We sadly don't have support for topic models, such as LDA. >>>> >>>> All the training logs are still written to the console, we have plans >>>> to properly capture them and report training process back via an >>>> API. This output should then be logged and maybe just stored in inside >>>> the model for later debugging. >>>> >>>> Jörn >>>> >>>> >>>> On 04/23/2012 07:41 PM, Alex Kudlick wrote: >>>> >>>> Hi, >>>>> >>>>> I've just started using open nlp for a project to classify scientific >>>>> articles in to subjects. I have a few questions: >>>>> >>>>> 1. How do I configure logging for the model? I'm using sf4j-log4j for >>>>> the >>>>> rest of my application, but the training output from the model just goes >>>>> to >>>>> stdout. >>>>> >>>>> 2. Is there any support for classifying documents with multiple classes? >>>>> For instance, a given article may be classified as Computational >>>>> Biology, >>>>> Cell Biology, and Molecular Biology. >>>>> >>>>> Thanks, >>>>> >>>>> Alex Kudlick >>>>> >>>>> >>>>> >>> >> > > > -- > Jason Baldridge > Associate Professor, Department of Linguistics > The University of Texas at Austin > http://www.jasonbaldridge.com > http://twitter.com/jasonbaldridge -- Lance Norskog [email protected]
