Re: text classification in portuguese

William Colen Fri, 20 Jan 2017 02:46:15 -0800

Hello Marcus,

You should take a look at Apache OpenNLP (http://opennlp.apache.org)
You can use it for pre-processing your data, for example to do Sentence
Detection and Tokenization.


http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.sentdetect
http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.tokenizer

You can find ready to use models for Portuguese here:
http://opennlp.sourceforge.net/models-1.5/

There is also a Snowball implementation you can instantiate just like this:

new SnowballStemmer(ALGORITHM.PORTUGUESE)

You could also give a try in its Document Categorizer module. It is really
easy to use even from command line:

http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.doccat

You just need to create a training corpus, which syntax is quite simple.

There is a lot of other things in OpenNLP that can help you with text
classification. Take a look at POS Tagging, Name Finder and Lemmatizer.

Regards,
William




---------- Forwarded message ----------
> From: Gustavo Frederico <[email protected]>
> Date: Thu, Jan 19, 2017 at 9:59 AM
> Subject: Re: text classification in portuguese
> To: [email protected]
>
>
> Marcus, at first sight this looks like a correct Json encoding. Json itself
> encodes the UTF-8 characters.
>
> Abraço
> Gustavo
>
> On Thu, Jan 19, 2017 at 8:54 AM, Marcus Vinicius <[email protected]>
> wrote:
>
> > Hello guys,
> >
> > I`m again. I`m trying to classify a portuguese text following the demo
> > tutorial (http://predictionio.incubator.apache.org/demo/textclassific
> > ation/).
> >
> > Someone already perform this with predictionIo? How could be the better
> > way to i lead with stemming and stop portuguese words?
> >
> > Allow me to take this opportunity to do another question. Someone has
> > problem with encoding? My csv load file is in ISO-8859 and in python
> script
> > i`m transforming my text to utf-8.
> >
> > text_utf8 = text.decode('iso-8859-1').encode('utf-8')
> >     client.create_event(
> >       event="documents",
> >       entity_type="source",
> >       entity_id=str(count), # use the count num as user ID
> >       properties= {
> >         "text" : text_utf8,
> >         "category" : attr[2],
> >         "label" : int(attr[3])
> >       }
> >     )
> >
> > When i retrive event from http://localhost:7070/events.json i got  a
> > encoded word. Is it right?
> >
> > {"eventId":"x","event":"documents","entityType":"
> source","entityId":"73","properties":{"category":"A","text":"Gest\u008bo
> de Caixa","label":2},"eventTime":"2017-01-19T12:31:27.863Z","
> creationTime":"2017-01-19T12:31:27.867Z"}
> >
> >
> > I really appreciate your attention.
> >
> >
> > --
> >
> > Marcus Vinicius A. Silva
> >
> > *P*  *ANTES DE IMPRIMIR pense em sua responsabilidade e compromisso
> > com o MEIO AMBIENTE.*
> >
>

Re: text classification in portuguese

Reply via email to