NER or TAGGER?

2015-08-16 Thread Damiano Porta
Hello everybody, I have just joined this mailing list! Thank you in advance for your help. I am studying a simple analizer that extracts specific information from a text. The information i would like to extract are: 1. Person 2. Company 3. Email address 4. Zipcode 5. Home address for email addre

Re: NER or TAGGER?

2015-08-17 Thread Damiano Porta
recognize entities of the three types and then do a regular expression > like pattern matching. For example Name>(\\W+)(\\W+)(\\W+) e.t.c. > > > On Mon, Aug 17, 2015 at 2:55 AM, Damiano Porta > wrote: > > > Hello everybody, > > I have just joined this mailing l

Re: NER or TAGGER?

2015-08-17 Thread Damiano Porta
kup. > > You can also use the list to bootstrap the training data. [This is an > advanced way, just ignore if you dont understand] > > On Mon, Aug 17, 2015 at 5:22 PM, Damiano Porta > wrote: > > > Hello Vihari, thank you for your reply! > > > > Are you sure i s

Best way to find Zipcodes and Telephones

2015-08-21 Thread Damiano Porta
Hello, I am thinking about the best method to find zipcodes and telephones inside my text. Zipcodes must have 5 digits and i also have a Dictionary with a list of real zipcodes of my country. So the first questions is: Do i have to train a NER model or use something like RegexNameFinder or Dictio

Re: Best way to find Zipcodes and Telephones

2015-08-22 Thread Damiano Porta
idity, I am sure you > can find a web service that provides this, depending on what country you’re > in. > > Cheers, > > Martin > > > > > Am 21.08.2015 um 20:17 schrieb Damiano Porta : > > > > Hello, > > I am thinking about the best method to find zip

What regex match? (RegexNameFinder)

2015-08-22 Thread Damiano Porta
Hello, i am using RegexNameFinder to extract specific patterns. I have a list of regexs, i would like to understand what regex match, is this possible? Thanks

Extract entities

2015-08-27 Thread Damiano Porta
Hello everybody, Let suppose the following lines are sentences: - Name: Damiano - Surname: Porta - First name: Damiano - Last name: Porta - Name/Surname: Damiano Porta - Name: Damiano Porta - First name and Last name: Damiano Porta - The name is Damiano and the surname is Porta. etc etc I need

Grammars

2015-12-13 Thread Damiano Porta
Hello! Is there a grammar(pattern engine) like https://gate.ac.uk/sale/tao/splitch8.html#chap:jape for OPENNLP ? Thank you!

TokensRegex

2015-12-28 Thread Damiano Porta
Hello, is there a tool like http://nlp.stanford.edu/software/tokensregex.shtml in OpenNLP? Thanks Damiano

Documents categorization

2016-09-24 Thread Damiano Porta
Hello, we need to categorize our documents in 80 sectors. These documents are resumes/cv. We have many documents (more than 30k) but there is a problem. Should we try to extract the job positions inside each resume and categorize them or can we just add the entire document and categorize it in one

Deprecated NameFinderME.train

2016-10-24 Thread Damiano Porta
Hello, looking at the test code of NameFinderME i found the deprecated *train* method (same thing on the official documentation). NameFinderME.train("en", "PERSON", sampleStream, TrainingParameters.defaultParams(), (byte[]) null, Collections.emptyMap()); that should be replaced with NameFinderME

Categorizer

2016-12-16 Thread Damiano Porta
Hello! can i use/pass a list of custom feature generators into a doccat model via XML? Like NER models for example. Thanks Damiano

Re: Categorizer

2016-12-16 Thread Damiano Porta
you. > > HTH, > Jörn > > On Fri, Dec 16, 2016 at 2:34 PM, Damiano Porta > wrote: > > > Hello! > > can i use/pass a list of custom feature generators into a doccat model > via > > XML? > > Like NER models for example. > > > > Thanks > > Damiano > > >

Re: Categorizer

2016-12-19 Thread Damiano Porta
> > > no, sadly this is not possible, you will have to provide a custom > factory > > > class which wires everything up for you. > > > > > > HTH, > > > Jörn > > > > > > On Fri, Dec 16, 2016 at 2:34 PM, Damiano Porta > > > >

Re: OpenNLP on Twitter

2017-01-01 Thread Damiano Porta
Eugene +1 +1 +1 +1 +1 +1 ... Il 01/Gen/2017 20:57, "Eugene Tenkaev" ha scritto: > And also need to be moved to GitHub with issue tracking there + Gitter for > communication with developers. Mailing list is too old, and hard to be used > > 2017-01-01 21:49 GMT+02:00 Rafik NACCACHE : > > > Gre

Re: OpenNLP on Twitter

2017-01-01 Thread Damiano Porta
Why not an official chat too? Mailing list is old Il 01/Gen/2017 21:10, "Joern Kottmann" ha scritto: > We are on Github: > https://github.com/apache/opennlp > > Jörn > > On Sun, 2017-01-01 at 21:56 +0200, Eugene Tenkaev wrote: > > And also need to be moved to GitHub with issue tracking there + >

Speed up training

2017-01-03 Thread Damiano Porta
Hello, I have a very very big training set, is there a way to speed up the training process? I only have changed the Xmx option inside bin/opennlp Thanks Damiano

Re: Speed up training

2017-01-03 Thread Damiano Porta
I am training a new postagger and lemmatizer. 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] : > Can you be a little more specific? What trainer are you using? > Thanks > Daniel > > On 1/3/17, 1:22 PM, "Damiano Porta" wrote: > > Hello, > I hav

Re: Speed up training

2017-01-03 Thread Damiano Porta
nnlp-tools/opennlp/tools/util/TrainingParameters.html#THREADS_PARAM > > William > > 2017-01-03 16:27 GMT-02:00 Damiano Porta : > > > I am training a new postagger and lemmatizer. > > > > 2017-01-03 19:24 GMT+01:00 Russ, Daniel (NIH/CIT) [E] < > dr...@mail.nih.gov

Re: Speed up training

2017-01-03 Thread Damiano Porta
Ok, i think the best value is matching the number of CPU cores, right? 2017-01-03 19:47 GMT+01:00 Russ, Daniel (NIH/CIT) [E] : > I do not believe the perceptron trainer is multithreaded. But it should > be fast. > > On 1/3/17, 1:44 PM, "Damiano Porta" wrote: > &g

Re: Speed up training

2017-01-03 Thread Damiano Porta
I always get Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded I am using 5GB on Xmx for a 1GB training data...i will try adding 7GB for training. Could the number of threads helps? 2017-01-03 19:57 GMT+01:00 Damiano Porta : > Ok, i think the

Re: Speed up training

2017-01-03 Thread Damiano Porta
your context generator. Maybe it is getting too many features. Try > to keep the strings small in the context generator. > > > 2017-01-03 17:02 GMT-02:00 Damiano Porta : > > > I always get Exception in thread "main" java.lang.OutOfMemoryError: GC > > overhead

Re: Speed up training

2017-01-03 Thread Damiano Porta
I am training the model in this way: opennlp POSTaggerTrainer -type maxent -model /home/damiano/it-pos-maxent-new.bin -lang it -data /home/damiano/postagger.train -encoding UTF-8 2017-01-03 21:01 GMT+01:00 Damiano Porta : > I am using the default postagger tool. > > I have many sente

Re: OpenNLP, telephone numbers, ticker symbols and URLs

2017-01-03 Thread Damiano Porta
Hello Chris, You do not need an extension. There is the RegexNameFinder that can match your entities as well here: https://github.com/apache/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/RegexNameFinder.java Damiano 2017-01-03 22:25 GMT+01:00 Christopher Hansen : > Hello

Proper way to extract name/surname from PERSON entity

2017-02-04 Thread Damiano Porta
Hello everybody, I have trained my NER (maxent) model and fortunately i have a good PERSON accuracy. My problem is when i need to split/extract the name and the surname from the person entity. What way can i follow to do this step? I thought about a classifier that tell me the class of each word

Custom tokenizer?

2017-09-04 Thread Damiano Porta
Hello everybody, I have to build a custom tokenizer that has one more class NOSPLIT. At the moment the current tokenizer supports SPLIT class, i should extend it because i have special code/products that must be in single token (but unfortunately they have whitespaces inside). What approach shoul

Custom features for sentence detector

2018-02-07 Thread Damiano Porta
Hello, can we add custom features on the sentence detector? Thanks Damiano

Re: Custom features for sentence detector

2018-02-14 Thread Damiano Porta
Thank you! 2018-02-14 9:44 GMT+01:00 Aliaksandr Autayeu : > Yes, you can. See SentenceDetectorFactory.getSDContextGenerator() method. > And respectively SDContextGenerator interface and the default > implementation in DefaultSDContextGenerator. > > On 7 February 2018 at 12:17

Re: Opennlp NER with Gazetteer

2018-04-05 Thread Damiano Porta
Hi Sohini, take a look at *DictionaryNameFinder* (https://github.com/apache/opennlp/blob/master/opennlp-tools/src/main/java/opennlp/tools/namefind/DictionaryNameFinder.java) Damiano Il 05/04/2018 17:57, Sohini Bagchi ha scritto: Hi, Has anyone used opennlp NER with gazetteer? If yes then

Should i escape new line?

2018-04-12 Thread Damiano Porta
Hello, i need new lines in my document. Should i escape it with a custom token like ? Thanks

Re: Should i escape new line?

2018-04-12 Thread Damiano Porta
Pardon, i did not explain "where"... i am talking about the training of a NER model 2018-04-12 12:05 GMT+02:00 Damiano Porta : > Hello, > i need new lines in my document. Should i escape it with a custom token > like ? > Thanks >

no roadmap?

2019-04-15 Thread Damiano Porta
Greetings to all! in spite of myself that the library is going very slow about development. I would like to understand if it is your intention to follow the development perhaps integrating more advanced solutions like neural networks or not. I have been using OpenNLP for a long time, but the advanc