Re: Document Classification

Lance Norskog Wed, 25 Apr 2012 18:38:17 -0700

Cool! Yeah, Tika has one also.

Now for the annoying use case: older web sites and pre-web text in
Southeast Asia and India/Pakistan are written in phonetic USASCII.
(They only had that technology available. Does anybody do
classification on that kind of text?


On Tue, Apr 24, 2012 at 7:17 AM, Jason Baldridge
<[email protected]> wrote:
> Naive Bayes, perceptron variants (incl passive agressive), faster training
> for maxent, and a better overall architecture. These are things my students
> and I are working on independently, and I will bring in to OpenNLP when
> time frees up to do so.
>
> On Tue, Apr 24, 2012 at 2:26 AM, Jörn Kottmann <[email protected]> wrote:
>
>> What are you planning to add?
>>
>> Jörn
>>
>>
>> On 04/24/2012 03:53 AM, Jason Baldridge wrote:
>>
>>> FWIW, there will be more classification capabilities coming in the next
>>> several months.
>>>
>>> -Jason
>>>
>>> On Mon, Apr 23, 2012 at 5:12 PM, Jörn Kottmann<[email protected]>
>>>  wrote:
>>>
>>>  OpenNLP is using either a Maxent or Perceptron classifier
>>>> to classify a piece of text. This can give you back the provabilities
>>>> for the various categories, but its not designed to tell you how
>>>> much each topic is represented in your input document.
>>>>
>>>> You could take a document and assume each paragraph has one topic
>>>> and then classify it paragraph by paragraph.
>>>> We sadly don't have support for topic models, such as LDA.
>>>>
>>>> All the training logs are still written to the console, we have plans
>>>> to properly capture them and report training process back via an
>>>> API. This output should then be logged and maybe just stored in inside
>>>> the model for later debugging.
>>>>
>>>> Jörn
>>>>
>>>>
>>>> On 04/23/2012 07:41 PM, Alex Kudlick wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> I've just started using open nlp for a project to classify scientific
>>>>> articles in to subjects.  I have a few questions:
>>>>>
>>>>> 1. How do I configure logging for the model? I'm using sf4j-log4j for
>>>>> the
>>>>> rest of my application, but the training output from the model just goes
>>>>> to
>>>>> stdout.
>>>>>
>>>>> 2. Is there any support for classifying documents with multiple classes?
>>>>> For instance, a given article may be classified as Computational
>>>>> Biology,
>>>>> Cell Biology, and Molecular Biology.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Alex Kudlick
>>>>>
>>>>>
>>>>>
>>>
>>
>
>
> --
> Jason Baldridge
> Associate Professor, Department of Linguistics
> The University of Texas at Austin
> http://www.jasonbaldridge.com
> http://twitter.com/jasonbaldridge



-- 
Lance Norskog
[email protected]

Re: Document Classification

Reply via email to