Re: Tokenizer for NER training

2017-03-02 Thread Russ, Daniel (NIH/CIT) [E]
nts so the telephone formats are many (separators of numbers too) . - / | \s 2017-03-02 18:38 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>: > Damino, > > I am not an expert on the NameFinder, but I don’t think you want to > use a cu

Re: Tokenizer for NER training

2017-03-02 Thread Russ, Daniel (NIH/CIT) [E]
Thanks Damiano 2017-03-02 18:00 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>: > Hi Damiano, >In general this is a difficult problem (making n-grams from unigrams). > Have you considered using RegEx to find your dates/phone numbers

Re: Tokenizer for NER training

2017-03-02 Thread Russ, Daniel (NIH/CIT) [E]
Hi Damiano, In general this is a difficult problem (making n-grams from unigrams). Have you considered using RegEx to find your dates/phone numbers etc. and protecting them from the tokenizer (i.e. replacing the white space with printable (though possible not an alphanumeric character like

Re: Help Required in Code

2017-02-08 Thread Russ, Daniel (NIH/CIT) [E]
I am not an expert on this part of the code, but I believe the idea is that there are multiple characters that can end a sentence (in English, think .!?). So it might be looking if any of the characters in the text match any of the end of sentence characters. Daniel On 2/8/17, 10:01 AM,

Re: Name Finder trainer default settings

2017-02-07 Thread Russ, Daniel (NIH/CIT) [E]
PM, Russ, Daniel (NIH/CIT) [E] < dr...@mail.nih.gov> wrote: > Hi Jörn, > > > >I think the best entity recognition systems use CRF’s. At some point > we might want to consider adding them. As you know, ME classifiers suffer > from l

Re: Name Finder trainer default settings

2017-02-07 Thread Russ, Daniel (NIH/CIT) [E]
, "Damiano Porta" <damianopo...@gmail.com> wrote: I have good results with perceptron, but +1 for CRF 2017-02-07 15:42 GMT+01:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>: > Hi Jörn, > > > >I think the best entity recognit

Re: Name Finder trainer default settings

2017-02-07 Thread Russ, Daniel (NIH/CIT) [E]
Hi Jörn, I think the best entity recognition systems use CRF’s. At some point we might want to consider adding them. As you know, ME classifiers suffer from label bias problem (see Lafferty et. al.) CRF’s deal

Re: Internal working of Open NLP

2017-02-06 Thread Russ, Daniel (NIH/CIT) [E]
I would like to answer your questions in reverse order… 5. How Maximum entropy works ? see A Maximum Entropy approach to NLP Berger, Della Pietra, Della Pietra. In Journal of Computation Lingutistics 22:1 (just google it…) In a nutshell, if you have no information all outcomes are equally

Re: [VOTE] Apache OpenNLP 1.7.2 Release Candidate

2017-02-03 Thread Russ, Daniel (NIH/CIT) [E]
+1 (non-binding) Have not run across problems with external code that uses OpenNLP On 2/3/17, 9:57 AM, "Rodrigo Agerri" wrote: +1 also pass tests On Fri, Feb 3, 2017 at 3:34 PM, Jeffrey Zemerick wrote: > +1 (non-binding)

Re: [VOTE] Apache OpenNLP 1.7.2 Release Candidate

2017-02-01 Thread Russ, Daniel (NIH/CIT) [E]
I’ll take a look at it. Daniel On 2/1/17, 7:02 AM, "Joern Kottmann" wrote: The GIS training is not printing any messages due to a bug. Lets cancel this vote and try to release again with that bug fixed. Also the Data Indexers printing can't be controlled witht

OpenNLP model for model 1.7.3+

2017-01-27 Thread Russ, Daniel (NIH/CIT) [E]
Hello, With the release of OpenNLP 1.7.3, the GISModel serialization will not be backwards compatible with the pre-1.7.3 format. I am particularly concerned with the models on SourceForge, because I still use them. They may not be the best models, but they work fairly well and are easily

Re: Thread-safe versions of some of the tools

2017-01-11 Thread Russ, Daniel (NIH/CIT) [E]
Hi, I am little confused. Why do you want to share an instance of a SentenceDetectorME across threads? Are you documents very long single sentences? I don’t think there is enough work for the SentenceDetectorME to make up the cost of multithreading on 4 cores. Previously, I had

Re: merge TrainingParameters and PluggableParameters

2017-01-10 Thread Russ, Daniel (NIH/CIT) [E]
to also help us resolve OPENNLP-675. Jörn On Tue, Jan 10, 2017 at 3:53 PM, Russ, Daniel (NIH/CIT) [E] < dr...@mail.nih.gov> wrote: > Hi, > >The point of PluggableParameters was to move the > “get(Int/String/Boolean

merge TrainingParameters and PluggableParameters

2017-01-10 Thread Russ, Daniel (NIH/CIT) [E]
Hi, The point of PluggableParameters was to move the “get(Int/String/Boolean)Parmeters” out of AbstractTrainer and into a parameters. I would like to merge the functionality of PluggableParameters into trainingParameters. Before I start, is there any reason why the GISTrainer and the

Trunk vs. Master

2017-01-09 Thread Russ, Daniel (NIH/CIT) [E]
Hello, I am a little confused by the fact we have both a trunk and a master branch. Which branch should be the baseline? Can we remove the other? Daniel Daniel Russ, Ph.D. Staff Scientist, Office of Intramural Research Center for Information Technology National Institutes of Health U.S.

pull update

2016-12-22 Thread Russ, Daniel (NIH/CIT) [E]
Hi, I made a few changes as suggested by Suneel (thanks Suneel). Do I have to close/re-open the pull request? Daniel Daniel Russ, Ph.D. Staff Scientist, Office of Intramural Research Center for Information Technology National Institutes of Health U.S. Department of Health and Human

Pull request

2016-12-21 Thread Russ, Daniel (NIH/CIT) [E]
Ok I created a repository on github and attempted a pull request. Did anyone get it? The repository is : https://github.com/danielruss/openNLP Thanks Daniel

Re: Next release

2016-11-07 Thread Russ, Daniel (NIH/CIT) [E]
Also the lemmatizer has significantly changed. I vote 1.7 On 11/7/16, 12:59 PM, "Joern Kottmann" wrote: Hello all, since our last release it has been a while and we received quite a few changes which would be nice to get released. There are still

Re: new tool training

2016-10-31 Thread Russ, Daniel (NIH/CIT) [E]
On 10/29/16, 8:45 AM, "Joern Kottmann" <kottm...@gmail.com> wrote: On Fri, 2016-10-28 at 14:16 +0000, Russ, Daniel (NIH/CIT) [E] wrote: > Hi Jörn, > 1) I agree that the field values should be set in the init method for > the QNTrainer. Other minor cha

Re: new tool training

2016-10-27 Thread Russ, Daniel (NIH/CIT) [E]
() should be exposed anymore. A new method train(DataIndexer) that calls isValid and then doTrain(indexer) is probably a better idea. Is it important to calculate the hash of all events? Daniel On 10/27/16, 11:49 AM, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov> wrote: Com

Re: new tool training

2016-10-27 Thread Russ, Daniel (NIH/CIT) [E]
/27/16, 11:15 AM, "Joern Kottmann" <kottm...@gmail.com> wrote: On Thu, Oct 27, 2016 at 4:41 PM, Russ, Daniel (NIH/CIT) [E] < dr...@mail.nih.gov> wrote: > Hello, > > Background: >I am developing a tool that uses OpenNLP. I have a mod

ContextGenerator

2016-10-21 Thread Russ, Daniel (NIH/CIT) [E]
Hello, Can we please make ContextGenerator a Generic type? I open a JIRA issue (OPENNLP-870). It is a simple fix. I can try to send a git diff. Daniel Daniel Russ, Ph.D. Staff Scientist, Office of Intramural Research Center for Information Technology National Institutes of Health U.S.

Re: Is sentence detection process really needed?

2016-08-26 Thread Russ, Daniel (NIH/CIT) [E]
ill be tagged the same. The problem in this case is that i need to create a tagger model too... Il 26/Ago/2016 20:14, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov<mailto:dr...@mail.nih.gov>> ha scritto: The POSTaggerME uses tokenized sentences. In your example, both cases hav

Re: Is sentence detection process really needed?

2016-08-26 Thread Russ, Daniel (NIH/CIT) [E]
ger for: "My name is Damiano. My surname is Porta" OR separate: My name is Damiano. My surname is Porta. I think the tags will be the same, no? Il 26/Ago/2016 18:24, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov<mailto:dr...@mail.nih.gov>> ha scritto:

Re: Is sentence detection process really needed?

2016-08-26 Thread Russ, Daniel (NIH/CIT) [E]
their contexts. So now i need to separate the sentences to create a custom model. At this point i will not try with one per line CV. Il 26/Ago/2016 15:10, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov<mailto:dr...@mail.nih.gov>> ha scritto: Hi Damiano, I am not sure that the

Re: Is sentence detection process really needed?

2016-08-26 Thread Russ, Daniel (NIH/CIT) [E]
information. No? Damiano Il 25/Ago/2016 16:26, "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov<mailto:dr...@mail.nih.gov>> ha scritto: Hi Damiano, Everyone can feel feel to correct my ignorance but I view the the name finder as follows. I look at it as walking d

Re: Is sentence detection process really needed?

2016-08-25 Thread Russ, Daniel (NIH/CIT) [E]
Hi Damiano, Everyone can feel feel to correct my ignorance but I view the the name finder as follows. I look at it as walking down the sentence and classifying words as “NOT IN NAME” until I hit the start of a name than it is “START NAME”, Followed by “STILL IN NAME” until “NOT IN

ContextGenerator

2016-06-20 Thread Russ, Daniel (NIH/CIT) [E]
Hello, Is it possible to change the ContextGenerator interface to use generics? I would send in a JIRA request, but I am not sure how to do it. Create has a nice big red button, the red "create service desk desk request is clear”, but then something about kylin, Atlas, Ranger, and Apache

Re: Surronding tokens of the entity on MaxEnt models

2016-05-02 Thread Russ, Daniel (NIH/CIT) [E]
00 different patterns). How can i create those features with my patterns? Thank you in advance! 2016-05-02 15:19 GMT+02:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov<mailto:dr...@mail.nih.gov>>: Hi Damiano, Why are you so sure that your model with not work? A couple of things

Re: Surronding tokens of the entity on MaxEnt models

2016-05-02 Thread Russ, Daniel (NIH/CIT) [E]
Hi Damiano, Why are you so sure that your model with not work? A couple of things to remember, 1. you need quite a bit of training data. Two sentences does not make a training set. 2. You probably need more than a window of words as your features. However, you can see that

Re: Question about OpenNLP and comparison to e.g., NTLK, Stanford NER, etc.

2015-11-12 Thread Russ, Daniel (NIH/CIT) [E]
Chris, Joern is correct. However, If I can slightly disagree on a few minor points. 1) I use the old sourceforge models. I find that the source of error in my analysis are usually not do to mistakes in sentence detection or POS tagging. I don’t have the annotated data or the time/money

Re: mallet addon

2015-10-20 Thread Russ, Daniel (NIH/CIT) [E]
from: https://opennlp.apache.org/mail-lists.html To un-subscribe send an e-mail to dev-unsubscr...@opennlp.apache.org Dan On Oct 20, 2015, at 10:43 AM, Eldad Yamin > wrote: How can I unsubscribe? On Sep

Re: OpenNLP 1.6.0 RC 3 ready for testing

2015-04-30 Thread Russ, Daniel (NIH/CIT) [E]
Any chance of getting my patch (OPENNLP-759) included in the next update? I know that the higher priority items get incorporated first. If someone has some time, it is a simple change. Dan On Apr 30, 2015, at 7:57 AM, William Colen wrote: Our third release candidate is ready for testing.