Hello Marcus, You should take a look at Apache OpenNLP (http://opennlp.apache.org) You can use it for pre-processing your data, for example to do Sentence Detection and Tokenization.
http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.sentdetect http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.tokenizer You can find ready to use models for Portuguese here: http://opennlp.sourceforge.net/models-1.5/ There is also a Snowball implementation you can instantiate just like this: new SnowballStemmer(ALGORITHM.PORTUGUESE) You could also give a try in its Document Categorizer module. It is really easy to use even from command line: http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.doccat You just need to create a training corpus, which syntax is quite simple. There is a lot of other things in OpenNLP that can help you with text classification. Take a look at POS Tagging, Name Finder and Lemmatizer. Regards, William ---------- Forwarded message ---------- > From: Gustavo Frederico <[email protected]> > Date: Thu, Jan 19, 2017 at 9:59 AM > Subject: Re: text classification in portuguese > To: [email protected] > > > Marcus, at first sight this looks like a correct Json encoding. Json itself > encodes the UTF-8 characters. > > Abraço > Gustavo > > On Thu, Jan 19, 2017 at 8:54 AM, Marcus Vinicius <[email protected]> > wrote: > > > Hello guys, > > > > I`m again. I`m trying to classify a portuguese text following the demo > > tutorial (http://predictionio.incubator.apache.org/demo/textclassific > > ation/). > > > > Someone already perform this with predictionIo? How could be the better > > way to i lead with stemming and stop portuguese words? > > > > Allow me to take this opportunity to do another question. Someone has > > problem with encoding? My csv load file is in ISO-8859 and in python > script > > i`m transforming my text to utf-8. > > > > text_utf8 = text.decode('iso-8859-1').encode('utf-8') > > client.create_event( > > event="documents", > > entity_type="source", > > entity_id=str(count), # use the count num as user ID > > properties= { > > "text" : text_utf8, > > "category" : attr[2], > > "label" : int(attr[3]) > > } > > ) > > > > When i retrive event from http://localhost:7070/events.json i got a > > encoded word. Is it right? > > > > {"eventId":"x","event":"documents","entityType":" > source","entityId":"73","properties":{"category":"A","text":"Gest\u008bo > de Caixa","label":2},"eventTime":"2017-01-19T12:31:27.863Z"," > creationTime":"2017-01-19T12:31:27.867Z"} > > > > > > I really appreciate your attention. > > > > > > -- > > > > Marcus Vinicius A. Silva > > > > *P* *ANTES DE IMPRIMIR pense em sua responsabilidade e compromisso > > com o MEIO AMBIENTE.* > > >
