Re: Custom features for sentence detector

2018-02-14 Thread Aliaksandr Autayeu
Yes, you can. See SentenceDetectorFactory.getSDContextGenerator() method.
And respectively SDContextGenerator interface and the default
implementation in DefaultSDContextGenerator.

On 7 February 2018 at 12:17, Damiano Porta  wrote:

> Hello,
> can we add custom features on the sentence detector?
> Thanks
> Damiano
>


Re: Default POS Tagger Dataset

2015-09-17 Thread Aliaksandr Autayeu
Penn Treebank: https://www.cis.upenn.edu/~treebank/

On 16 September 2015 at 21:26, Nishant Kelkar  wrote:

> Hi all,
>
> Just wanted to know: what is the data set used to train the default POS
> tagger en-pos-maxent.bin, and where can I download it?
>
> Thanks!
>
> Best Regards,
> Nishant Kelkar
>


Re: JWNL bug???

2015-06-10 Thread Aliaksandr Autayeu
Most likely not. It looks like the first option refers to PennTreeBank tags
(nouns - N-N, N-NS, etc, verbs - V-B, V-BD, etc, adjectives - J-J, J-JR,
J-JS, adverbs - R-B, etc) and the second option refers to WordNet nvar
tags - n-oun, v-erb, a-djective, adve-r-b. It's a bit strange to see two
type of tags together, but it does not seem too random :) This line from
main confirms the first guess:

dict.getLemmas(word,NN)

Aliaksandr

On 10 June 2015 at 16:43, Russ, Daniel (NIH/CIT) [E] dr...@mail.nih.gov
wrote:

 Hi,
I am not sure if this is a bug or not.  In getLemmas(String word,
 String tag) method of JWNLDictionary, if you are looking up adjectives it
 checks if the tag starts with J or a.  Anyone know if this a bug or
 deliberate?  (JWNL-1.3.3 distributed with opennlp-1.5.3)

   if (tag.startsWith(N) || tag.startsWith(n)) {
 pos = POS.NOUN;
   }
   else if (tag.startsWith(V) || tag.startsWith(v)) {
 pos = POS.VERB;
   }
   else if (tag.startsWith(J) || tag.startsWith(a)) {
 pos = POS.ADJECTIVE;
   }
   else if (tag.startsWith(R) || tag.startsWith(r)) {
 pos = POS.ADVERB;
   }
   else {
 pos = POS.NOUN;
   }

 Dan



Re: English lemmatizer using wordnet

2013-04-12 Thread Aliaksandr Autayeu
 Spanish pos + lemmatizer using this approach.


 +1, it would be nice to have control over the dictionary, maybe we can
 come up with
 a format to store it in. That will allow us to easily include it in our
 models
 as a resource for feature generation and eliminates the dependency on
 external libraries.

That would be great! The format should  then take into account
morphological features.


  Of course, another method would be to re-implement John Carroll and
 colleagues'  finite-state approach for English (and similar rule-based
 approaches for other languages) which removes the dependence on a
 dictionary. I will be exploring this further on.


 +1

 We should define an interface which allows to use different
 implementations like
 we did for the other components.

+1. It seems that we have european languages represented here. Do we have
anybody from east? chinese? Would be nice to check them too.


Re: Tagsets OpenNLP

2013-01-25 Thread Aliaksandr Autayeu
If I'm not mistaken and understood you correctly, it's a PennTreeBank
tagset: http://www.cis.upenn.edu/~treebank/

cheers,
Aliaksandr

2013/1/25 Javier SANCHEZ MONZON javier.sanchez-mon...@unister.de

  Hi there
 i would like to know if is there a tagset list for the postaging task in
 OpenNLP?
 tankk u in advance
 greetings,
 Tino
 --

 * Javier SANCHEZ MONZON *
 Entwickler Unister Adserver (bzw. Portal)

 Unister Holding GmbH
 Barfußgässchen 11 | 04109 Leipzig

 Telefon: +49 (0)30 7202207 Durchwahl 18-307
  javier.sanchez-mon...@unister.de %0a%20%20vorname.n...@unister.de
 www.unister.de

 Unister im Netz:
  http://www.facebook.com/unistergruppe http://twitter.com/unister
 https://www.xing.com/companies/unistergmbh/about



 Vertretungsberechtigter Geschäftsführer: Thomas Wagner
 Amtsgericht Leipzig, HRB: 25007



Re: NER using perceptron instead of MaXent?

2012-10-05 Thread Aliaksandr Autayeu
Jim, you might use command line tools source code as a hint as well ;)

Aliaksandr

On Fri, Oct 5, 2012 at 5:25 PM, Jim foo.bar jimpil1...@gmail.com wrote:

 Hi William,

 First of all thanks for the prompt reply, however I am using the API not
 the cmd tool...
 where do I pass that properties file?

 Jim



 On 05/10/12 16:20, William Colen wrote:

 Yes, you can. Just create a properties file like follows

 Threads=8
 Iterations=100
 Algorithm=PERCEPTRON
 Cutoff=0

 and train passing the properties file in the argument -params

 Note that perceptron models benefit from cutoff 0. For Maxent models
 usually you set something like 5. I usually try different values with the
 CV tool to find the best cutoff value.

 Regards
 William

 On Fri, Oct 5, 2012 at 11:59 AM, Jim foo.bar jimpil1...@gmail.com
 wrote:

  Hi all,

 is it at all possible to train a  name-finder using perceptron instead of
 maxent? The documentation says that openNLP supports both but I can only
 find examples for pos-tagging and nothing else (sentence detection ,NER,
 chunking etc etc)...

 thanks in advance...

 Jim





Re: Anyone see issues with jwnl library hangs?

2012-08-07 Thread Aliaksandr Autayeu
I had similar issues with JWNL, but long time ago, I don't remember details
now. A small piece of code to reproduce the issue would help a lot looking
into it ;)

Aliaksandr

On Mon, Aug 6, 2012 at 10:39 PM, Jörn Kottmann kottm...@gmail.com wrote:

 Hello,

 never experienced that issue.

 Its blocking in RandomAccessFile.readLine,
 according to the javadoc it should not block forever, one of the following
 conditions should usually be reached quickly.

 This method blocks until a newline character is read, a carriage return
 and the byte following it are read (to see if it is a newline), the end of
 the file is reached, or an exception is thrown.

 http://docs.oracle.com/javase/**1.4.2/docs/api/java/io/**
 RandomAccessFile.html#**readLine()http://docs.oracle.com/javase/1.4.2/docs/api/java/io/RandomAccessFile.html#readLine()

 Not sure whats going wrong there. Can you post some code to reproduce it?
 Maybe the call is reading in too much data.

 Jörn


 On 08/05/2012 03:40 AM, Chris Collins wrote:

 I am building a classifier with OpenNLP and leveraging JWNLDictionary.
  In my experiments I am finding after many invocations of the classifier it
 hangs in jwnl.  Specifically doing a read.  I am using WordNet 3.0 (normal
 princeton distro, not stanfords).



 The thread dump is below (well not all of it but the OpenNLP + jwnl part.
  In the halt case it was trying to work with the word
 found_r_n_rnhttpsttlc_**blablacompost_show_post_full_**
 view_dw_w_b_ar_dxxcfazbay_**ppid_rnrn

 Now clearly that isnt a word so I should work on my tokenization :-}

 I tried this in a single thread and tried the default jwnl defined in the
 opennlp pom and also 1.4 rc3

 Any pointers would be helpful.

 Cheers

 C


 java.lang.Thread.State: RUNNABLE
   at java.io.RandomAccessFile.read(**RandomAccessFile.java:-1)
   at java.io.RandomAccessFile.**readLine(RandomAccessFile.**
 java:871)
   at net.didion.jwnl.princeton.**file.**
 PrincetonRandomAccessDictionar**yFile.readLine(**
 PrincetonRandomAccessDictionar**yFile.java:48)
   at net.didion.jwnl.dictionary.**file_manager.FileManagerImpl.**
 getIndexedLinePointer(**FileManagerImpl.java:220)
   - locked 0x1071 (a net.didion.jwnl.princeton.**file.**
 PrincetonRandomAccessDictionar**yFile)
   at net.didion.jwnl.dictionary.**FileBackedDictionary.**
 getIndexWord(**FileBackedDictionary.java:171)
   at net.didion.jwnl.dictionary.**morph.**
 LookupIndexWordOperation.**execute(**LookupIndexWordOperation.java:**15)
   at net.didion.jwnl.dictionary.**morph.**
 AbstractDelegatingOperation.**delegate(**AbstractDelegatingOperation.**
 java:47)
   at net.didion.jwnl.dictionary.**morph.TokenizerOperation.**
 tryAllCombinations(**TokenizerOperation.java:131)
   at net.didion.jwnl.dictionary.**morph.TokenizerOperation.**
 tryAllCombinations(**TokenizerOperation.java:102)
   at net.didion.jwnl.dictionary.**morph.TokenizerOperation.**
 execute(TokenizerOperation.**java:75)
   at net.didion.jwnl.dictionary.**morph.**
 DefaultMorphologicalProcessor$**LookupInfo.**executeNextOperation(**
 DefaultMorphologicalProcessor.**java:172)
   at net.didion.jwnl.dictionary.**morph.**
 DefaultMorphologicalProcessor.**lookupNextBaseForm(**
 DefaultMorphologicalProcessor.**java:125)
   at net.didion.jwnl.dictionary.**morph.**
 DefaultMorphologicalProcessor.**lookupAllBaseForms(**
 DefaultMorphologicalProcessor.**java:142)
   at opennlp.tools.coref.mention.**JWNLDictionary.getLemmas(**
 JWNLDictionary.java:99)





Re: Training a POS tagger model

2012-07-27 Thread Aliaksandr Autayeu
Hi Alessandra,

 I would like to provide (train) a POS tagger model for italian language.
 I have some questions:
  - may I use a token_tag pair list in place of sentence list? Something
 like:
  casa_NOUN
  e_CON (conjuction)

This way you loose context. There is a window (few tokens around the target
token) which is a feature for POS tagger, and it is used in training. By
formatting your dataset this way, you loose this feature.



  ...
  in place of
 
  la_ART casa_NOUN e_CON la_ART strada_NOUN
  ...
  because I have founded an italian word list.

Well, if it is a word list (arbitrary words, not connected to, e.g. like a
dictionary), then it is not a text and it does not make a lot of sense to
train a model on it. But from your example it looks like you have tagged
sentences, they are just formatted in a different way. So, you have two
options: 1) reformat you dataset into a format OpenNLP supports 2) Write a
java class to support your format in OpenNLP. Your format looks quite
simple (do you have sentence delimiters?), so 1) might be feasible with
something like awk or sed.



  - Do I need to provide a tag dictionary? Is there a default tag
 dictionary?

Tag dictionary improves performance of the model, but it is not needed. It
is optional. AFAIK, for Italian there is no default tag dictionary in
OpenNLP.

Aliaksandr