abbdict format

2017-04-12 Thread Benedict Holland
Hello All, I am getting into NLP for a project and this is the solution we are going to use. I noticed that in many places there is something called the abbdict flag but there is not a specification for it. I believe it is an xml document. Could someone please provide a sample xml file and a brief

Re: abbdict format

2017-04-13 Thread Benedict Holland
ce. > https://opennlp.apache.org/documentation/1.7.2/manual/opennlp.html > > Regards, > William > > 2017-04-12 18:29 GMT-03:00 Benedict Holland > : > > > Hello All, > > > > I am getting into NLP for a project and this is the solution we are going > >

Saving models to a database?

2017-04-14 Thread Benedict Holland
Hello Everyone, I am wondering if there is a good tutorial for saving and loading models to/from a database. I have not found one yet but the documentation states that I can. Thanks, ~Ben

Re: Saving models to a database?

2017-04-14 Thread Benedict Holland
aving OpenNLP models to a database. Thanks, ~Ben On Fri, Apr 14, 2017 at 11:47 AM, Daniel Russ wrote: > Are you taking about a BaseModel (ie, sententeceDetectorModel, POSModel…) > or a MaxentModel > > -Daniel > > > > On 4/14/17, 11:36 AM, "Benedict Holland&quo

Re: Saving models to a database?

2017-04-14 Thread Benedict Holland
Sure. It is actually throughout the document. All I had to do was search for "database" in https://opennlp.apache.org/documentation/1.7.2/manual/opennlp.html I pulled text above as a copy and paste. I think the solution I was looking for would be to take a model's database connection (input stea

Re: Saving models to a database?

2017-04-14 Thread Benedict Holland
el.html#method.summary > > Have you tried serializing to a ByteArrayStream. Getting the byte[] with > toByteArray(), and writing to database as a blob? > Daniel > > > > On Apr 14, 2017, at 1:13 PM, Benedict Holland < > benedict.m.holl...@gmail.com> wrote: > > >

UIMA GC Out of Memory exception

2017-04-17 Thread Benedict Holland
I create the pear file and everything compiles. I call java with -Xms1M -Xms1M where where the last line in runUimaClass.bat is @"%UIMA_JAVA_CALL%" -Xms1M -Xms1M -DVNS_HOST=%VNS_HOST% -DVNS_PORT=%VNS_PORT% "-Duima.home=%UIMA_HOME%" "-Duima.datapath=%UIMA_DATAPATH%" "-Djava.util.lo

Re: UIMA GC Out of Memory exception

2017-04-18 Thread Benedict Holland
Hi Thilo, It should have been and I changed it and still receive an identical error. Thanks, ~Ben On Tue, Apr 18, 2017 at 8:06 AM, Thilo Goetz wrote: > The second -Xms should be -Xmx instead? > > > > On 18.04.17 00:06, Benedict Holland wrote: > >> I create the

Re: UIMA GC Out of Memory exception

2017-04-18 Thread Benedict Holland
As an update, it appears that when I run the Eclipse UIMA CAS Visual Debugger tool with the pear file, it works. I will post this to the UIMA message group but I am curious if anyone has run into this before? Thanks, ~Ben On Tue, Apr 18, 2017 at 11:16 AM, Benedict Holland < benedict.m.h

Re: UIMA GC Out of Memory exception

2017-04-19 Thread Benedict Holland
mething along those lines... > > --Thilo > > > > On 18.04.17 18:47, Benedict Holland wrote: > >> As an update, it appears that when I run the Eclipse UIMA CAS Visual >> Debugger tool with the pear file, it works. I will post this to the UIMA >> message group but I am c

Re: Training data sets size for Word Tokenizer and Sentence Detector

2017-09-25 Thread Benedict Holland
Hello, I am almost certain that you will have to pay for data sources. There are a few that are very reasonable, such as the entire Wikipedia set (roughly 3 billion words) across many languages. I have not found a free one, particularly for names, and I would be very interested in that possibility

List of names training name finder

2017-10-03 Thread Benedict Holland
Hello all, We are attempting to develop a model with a list of names. We have a long and comprehensive list of names but very little text surrounding them. We have some text that we can tag, though not much. Is it possible to create a name finding model using a simple list like this or do we have

Re: List of names training name finder

2017-10-04 Thread Benedict Holland
eSpans[i] +" "+ names[i]); > } > > } > > } > > [0..1) default Daniel > [2..3) default Al > [4..5) default Bob > > > On Oct 3, 2017, at 2:27 PM, Benedict Holland < > benedict.m.holl...@gmail.com> wrote: > > > &

Name training data sentences

2017-10-06 Thread Benedict Holland
Hello all, I am working on getting together a file with a list of tokenized sentences. I have a quick question: Can name training data contain sentences without any tags? For example, if I had a sentence like Molly enjoys pancakes in the morning . She does not enjoy being woken up at 4:30 by

Re: Name training data sentences

2017-10-06 Thread Benedict Holland
n Russ wrote: > >> I believe it does. Every word is classified as “begin”, “inside”, or > “outside” - BIO encoding, so an event is generated for “she” and then > “does” and then “not” — all of which is classified as “outside”. > >> > >> Anyone smarter have a comme

Re: Dictionary Name Finder

2017-10-10 Thread Benedict Holland
Hi Manoj, Couldn't you just add the 2 token name out of the 3? If the order matters, always have the more specific first and go to less specific. What you are describing is a problem specifically associated with dictionary lookups: that unless there is an exact match, nothing will match. Dictionar

Re: Regarding Usage of Model

2017-12-16 Thread Benedict Holland
No. It isn't free. This is how linguists make money. That said, the data isn't expensive. I think the name training data is less than 1,000 dollars. It might be less for academic use. Thanks, ~Ben On Dec 16, 2017 2:37 PM, "Jeff Zemerick" wrote: > Unfortunately, I don't think that data is availa

Re: [ANNOUNCE] OpenNLP 1.8.4 released

2017-12-26 Thread Benedict Holland
I don't know if this is proper but CONGRATULATIONS! Thanks, ~Ben On Tue, Dec 26, 2017 at 9:18 AM, Jeff Zemerick wrote: > The Apache OpenNLP team is pleased to announce the release of version 1.8.4 > of Apache OpenNLP. The Apache OpenNLP library is a machine learning based > toolkit for the proc

Any english-lemmatizer.txt file?

2018-04-03 Thread Benedict Holland
Hello all, Does either the english-lemmatizer.txt or the en-lemmatizer.bin or en-lemmatizer.dict exist in the git tree? If not, do you have a good place I could get one? Thanks, ~Ben

Multiple document categories for MaxEnt model?

2018-04-12 Thread Benedict Holland
Hello all, I understand that maximum entropy models are excellent at categorizing documents. As it turns out, I have a situation where 1 document can be in many categories (1:m relationship). I believe that I could create training data that looks something like: category_1 category_2 ... If I

Re: Multiple document categories for MaxEnt model?

2018-04-12 Thread Benedict Holland
Have 1 model for each label: > > train_cat1.txt... > cat_1_TRUE > cat_1_FALSE > … > > train_cat2.txt… > cat_2_FALSE > cat_2_TRUE > > Hope it helps, Let me know what you wind up doing... > Daniel > > > On Apr 12, 2018, at 4:22 PM, Benedict Holland

Document Categorizer questions

2018-10-02 Thread Benedict Holland
Hello all, I have a few questions about the document categorizer that reading the manual didn't solve. 1. How many individual categories can I include in the training data? 2. Assume I have C categories. If I assume a document will have multiple categories *c*, should I develop C separate models

Re: Document Categorizer questions

2018-10-03 Thread Benedict Holland
any non-linear combinations of features for > the best set of features for classification (limited only by the features > you supply). Deep learning is kind of like modeling the features. > > Hope it helps > Daniel > > > > On Oct 2, 2018, at 1:28 PM, Benedict Holland < &

Re: Document Categorizer questions

2018-10-03 Thread Benedict Holland
the stoplight problem. However, Nikolai’s data may be > have some property that works really well with NB. One thing to remember is > that proof of the pudding is in the eating. > > Daniel > > > > On Oct 3, 2018, at 11:49 AM, Benedict Holland < > benedict.m.holl...

maxent produces very high probabilties for texts without overlap

2018-10-17 Thread Benedict Holland
Hello all, I can't quite figure out how the Doccat MaxEnt modeling works. Here is my setup: I have a set of training texts split into is_cat_1 and is_not_cat_1. I train my model using the default bag of words model. I have a document without any overlapping text with texts that are in is_cat_1. T

Re: maxent produces very high probabilties for texts without overlap

2018-10-17 Thread Benedict Holland
can’t tell the > two categories”. You probably don’t want to think of it as “You don’t look > like CAT_1 so you are NOT_CAT_1”. > Daniel > > > On Oct 17, 2018, at 1:14 PM, Benedict Holland < > benedict.m.holl...@gmail.com> wrote: > > > > Hello all, > >

Re: maxent produces very high probabilties for texts without overlap

2018-10-17 Thread Benedict Holland
ache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training > ) > > Is_cat_1 > Is_not_cat_1 > > Is that how you formatted your data? > Daniel > > > On Oct 17, 2018, at 3:50 PM, Benedict Holland < > benedict.m.holl...@gmail.com> wrote: > > > &g

Re: maxent produces very high probabilties for texts without overlap

2018-10-18 Thread Benedict Holland
compare the results. It’s > late in the day on the US East coast, so I may not be able to get to it > until tomorrow. > Daniel > > > > On Oct 17, 2018, at 4:27 PM, Benedict Holland < > benedict.m.holl...@gmail.com> wrote: > > > > I mean... not really? I