Re: Custom features for sentence detector
Yes, you can. See SentenceDetectorFactory.getSDContextGenerator() method. And respectively SDContextGenerator interface and the default implementation in DefaultSDContextGenerator. On 7 February 2018 at 12:17, Damiano Portawrote: > Hello, > can we add custom features on the sentence detector? > Thanks > Damiano >
Re: Default POS Tagger Dataset
Penn Treebank: https://www.cis.upenn.edu/~treebank/ On 16 September 2015 at 21:26, Nishant Kelkarwrote: > Hi all, > > Just wanted to know: what is the data set used to train the default POS > tagger en-pos-maxent.bin, and where can I download it? > > Thanks! > > Best Regards, > Nishant Kelkar >
Re: JWNL bug???
Most likely not. It looks like the first option refers to PennTreeBank tags (nouns - N-N, N-NS, etc, verbs - V-B, V-BD, etc, adjectives - J-J, J-JR, J-JS, adverbs - R-B, etc) and the second option refers to WordNet nvar tags - n-oun, v-erb, a-djective, adve-r-b. It's a bit strange to see two type of tags together, but it does not seem too random :) This line from main confirms the first guess: dict.getLemmas(word,NN) Aliaksandr On 10 June 2015 at 16:43, Russ, Daniel (NIH/CIT) [E] dr...@mail.nih.gov wrote: Hi, I am not sure if this is a bug or not. In getLemmas(String word, String tag) method of JWNLDictionary, if you are looking up adjectives it checks if the tag starts with J or a. Anyone know if this a bug or deliberate? (JWNL-1.3.3 distributed with opennlp-1.5.3) if (tag.startsWith(N) || tag.startsWith(n)) { pos = POS.NOUN; } else if (tag.startsWith(V) || tag.startsWith(v)) { pos = POS.VERB; } else if (tag.startsWith(J) || tag.startsWith(a)) { pos = POS.ADJECTIVE; } else if (tag.startsWith(R) || tag.startsWith(r)) { pos = POS.ADVERB; } else { pos = POS.NOUN; } Dan
Re: English lemmatizer using wordnet
Spanish pos + lemmatizer using this approach. +1, it would be nice to have control over the dictionary, maybe we can come up with a format to store it in. That will allow us to easily include it in our models as a resource for feature generation and eliminates the dependency on external libraries. That would be great! The format should then take into account morphological features. Of course, another method would be to re-implement John Carroll and colleagues' finite-state approach for English (and similar rule-based approaches for other languages) which removes the dependence on a dictionary. I will be exploring this further on. +1 We should define an interface which allows to use different implementations like we did for the other components. +1. It seems that we have european languages represented here. Do we have anybody from east? chinese? Would be nice to check them too.
Re: Tagsets OpenNLP
If I'm not mistaken and understood you correctly, it's a PennTreeBank tagset: http://www.cis.upenn.edu/~treebank/ cheers, Aliaksandr 2013/1/25 Javier SANCHEZ MONZON javier.sanchez-mon...@unister.de Hi there i would like to know if is there a tagset list for the postaging task in OpenNLP? tankk u in advance greetings, Tino -- * Javier SANCHEZ MONZON * Entwickler Unister Adserver (bzw. Portal) Unister Holding GmbH Barfußgässchen 11 | 04109 Leipzig Telefon: +49 (0)30 7202207 Durchwahl 18-307 javier.sanchez-mon...@unister.de %0a%20%20vorname.n...@unister.de www.unister.de Unister im Netz: http://www.facebook.com/unistergruppe http://twitter.com/unister https://www.xing.com/companies/unistergmbh/about Vertretungsberechtigter Geschäftsführer: Thomas Wagner Amtsgericht Leipzig, HRB: 25007
Re: NER using perceptron instead of MaXent?
Jim, you might use command line tools source code as a hint as well ;) Aliaksandr On Fri, Oct 5, 2012 at 5:25 PM, Jim foo.bar jimpil1...@gmail.com wrote: Hi William, First of all thanks for the prompt reply, however I am using the API not the cmd tool... where do I pass that properties file? Jim On 05/10/12 16:20, William Colen wrote: Yes, you can. Just create a properties file like follows Threads=8 Iterations=100 Algorithm=PERCEPTRON Cutoff=0 and train passing the properties file in the argument -params Note that perceptron models benefit from cutoff 0. For Maxent models usually you set something like 5. I usually try different values with the CV tool to find the best cutoff value. Regards William On Fri, Oct 5, 2012 at 11:59 AM, Jim foo.bar jimpil1...@gmail.com wrote: Hi all, is it at all possible to train a name-finder using perceptron instead of maxent? The documentation says that openNLP supports both but I can only find examples for pos-tagging and nothing else (sentence detection ,NER, chunking etc etc)... thanks in advance... Jim
Re: Anyone see issues with jwnl library hangs?
I had similar issues with JWNL, but long time ago, I don't remember details now. A small piece of code to reproduce the issue would help a lot looking into it ;) Aliaksandr On Mon, Aug 6, 2012 at 10:39 PM, Jörn Kottmann kottm...@gmail.com wrote: Hello, never experienced that issue. Its blocking in RandomAccessFile.readLine, according to the javadoc it should not block forever, one of the following conditions should usually be reached quickly. This method blocks until a newline character is read, a carriage return and the byte following it are read (to see if it is a newline), the end of the file is reached, or an exception is thrown. http://docs.oracle.com/javase/**1.4.2/docs/api/java/io/** RandomAccessFile.html#**readLine()http://docs.oracle.com/javase/1.4.2/docs/api/java/io/RandomAccessFile.html#readLine() Not sure whats going wrong there. Can you post some code to reproduce it? Maybe the call is reading in too much data. Jörn On 08/05/2012 03:40 AM, Chris Collins wrote: I am building a classifier with OpenNLP and leveraging JWNLDictionary. In my experiments I am finding after many invocations of the classifier it hangs in jwnl. Specifically doing a read. I am using WordNet 3.0 (normal princeton distro, not stanfords). The thread dump is below (well not all of it but the OpenNLP + jwnl part. In the halt case it was trying to work with the word found_r_n_rnhttpsttlc_**blablacompost_show_post_full_** view_dw_w_b_ar_dxxcfazbay_**ppid_rnrn Now clearly that isnt a word so I should work on my tokenization :-} I tried this in a single thread and tried the default jwnl defined in the opennlp pom and also 1.4 rc3 Any pointers would be helpful. Cheers C java.lang.Thread.State: RUNNABLE at java.io.RandomAccessFile.read(**RandomAccessFile.java:-1) at java.io.RandomAccessFile.**readLine(RandomAccessFile.** java:871) at net.didion.jwnl.princeton.**file.** PrincetonRandomAccessDictionar**yFile.readLine(** PrincetonRandomAccessDictionar**yFile.java:48) at net.didion.jwnl.dictionary.**file_manager.FileManagerImpl.** getIndexedLinePointer(**FileManagerImpl.java:220) - locked 0x1071 (a net.didion.jwnl.princeton.**file.** PrincetonRandomAccessDictionar**yFile) at net.didion.jwnl.dictionary.**FileBackedDictionary.** getIndexWord(**FileBackedDictionary.java:171) at net.didion.jwnl.dictionary.**morph.** LookupIndexWordOperation.**execute(**LookupIndexWordOperation.java:**15) at net.didion.jwnl.dictionary.**morph.** AbstractDelegatingOperation.**delegate(**AbstractDelegatingOperation.** java:47) at net.didion.jwnl.dictionary.**morph.TokenizerOperation.** tryAllCombinations(**TokenizerOperation.java:131) at net.didion.jwnl.dictionary.**morph.TokenizerOperation.** tryAllCombinations(**TokenizerOperation.java:102) at net.didion.jwnl.dictionary.**morph.TokenizerOperation.** execute(TokenizerOperation.**java:75) at net.didion.jwnl.dictionary.**morph.** DefaultMorphologicalProcessor$**LookupInfo.**executeNextOperation(** DefaultMorphologicalProcessor.**java:172) at net.didion.jwnl.dictionary.**morph.** DefaultMorphologicalProcessor.**lookupNextBaseForm(** DefaultMorphologicalProcessor.**java:125) at net.didion.jwnl.dictionary.**morph.** DefaultMorphologicalProcessor.**lookupAllBaseForms(** DefaultMorphologicalProcessor.**java:142) at opennlp.tools.coref.mention.**JWNLDictionary.getLemmas(** JWNLDictionary.java:99)
Re: Training a POS tagger model
Hi Alessandra, I would like to provide (train) a POS tagger model for italian language. I have some questions: - may I use a token_tag pair list in place of sentence list? Something like: casa_NOUN e_CON (conjuction) This way you loose context. There is a window (few tokens around the target token) which is a feature for POS tagger, and it is used in training. By formatting your dataset this way, you loose this feature. ... in place of la_ART casa_NOUN e_CON la_ART strada_NOUN ... because I have founded an italian word list. Well, if it is a word list (arbitrary words, not connected to, e.g. like a dictionary), then it is not a text and it does not make a lot of sense to train a model on it. But from your example it looks like you have tagged sentences, they are just formatted in a different way. So, you have two options: 1) reformat you dataset into a format OpenNLP supports 2) Write a java class to support your format in OpenNLP. Your format looks quite simple (do you have sentence delimiters?), so 1) might be feasible with something like awk or sed. - Do I need to provide a tag dictionary? Is there a default tag dictionary? Tag dictionary improves performance of the model, but it is not needed. It is optional. AFAIK, for Italian there is no default tag dictionary in OpenNLP. Aliaksandr