Re: Model to detect the gender

2016-07-04 Thread Damiano Porta
Jorn, please, could you link me that model?

2016-07-04 14:42 GMT+02:00 Joern Kottmann :

> The co-referencer we used used to have in opennlp-tools has a model to
> detect the gender of names. That could could be extracted and put into a
> stand alone component.
>
> Jörn
>
> On Mon, Jul 4, 2016 at 2:41 PM, Joern Kottmann  wrote:
>
> > I was speaking about the second case. We could build a dedicated
> component
> > specialized in extracting properties about already detected entities.
> >
> > Jörn
> >
> > On Mon, Jul 4, 2016 at 2:33 PM, Damiano Porta 
> > wrote:
> >
> >> Hello Jorn,
> >> Do you mean that i need to "extend" my NER model to find other
> >> name-related
> >> entities too?
> >>
> >> OR
> >>
> >> Find the entities with a dictionary and then train a maxent model that
> >> finds other properties like person title, job position etc?
> >>
> >> Thanks for the clarification.
> >>
> >>
> >> 2016-07-04 12:15 GMT+02:00 Joern Kottmann :
> >>
> >> > Hello,
> >> >
> >> > there are also other interesting properties e.g. person title (e.g.
> >> > professor, doctor), job title/position,
> >> > company legal form. And much more for other entity types.
> >> >
> >> > Maybe it would be worth it to build a dedicated component to extract
> >> > properties from entities.
> >> >
> >> > Jörn
> >> >
> >> > On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi <
> >> > mondher.bouaz...@gmail.com
> >> > > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Sorry for my late reply. I didn't understand well your last email,
> but
> >> > here
> >> > > is what I meant:
> >> > >
> >> > > Given a simple dictionary you have that has the following columns:
> >> > >
> >> > > Name   Type   Gender
> >> > > Agatha First   F
> >> > > JohnFirst   M
> >> > > Smith  Both   B
> >> > >
> >> > > where:
> >> > > - "First" refers to first name, "Last" (not in the example) refers
> to
> >> > last
> >> > > name, and Both means it can be both.
> >> > > - "F" refers to female, "M" refers to males, and "B" refers to both
> >> > > genders.
> >> > >
> >> > > and given the following two sentences:
> >> > >
> >> > > 1. "It was nice meeting you John. I hope we meet again soon."
> >> > >
> >> > > 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case
> and
> >> > felt
> >> > > she knows something"
> >> > >
> >> > > In the first example, when you check in the dictionary, the name
> >> "John"
> >> > is
> >> > > a male name, so no need to go any further.
> >> > > However, in the second example, the name "Smith", which is a family
> >> name
> >> > in
> >> > > our case, can be fit for both, males and females. Therefore, we need
> >> to
> >> > > extract features from the surrounding context and perform a
> >> > classification
> >> > > task.
> >> > > Here are some of the features I think they would be interesting to
> >> use:
> >> > >
> >> > > . Presence of a male initiative before the word {True, False}
> >> > > . Presence of a female initiative before the word {True, False}
> >> > >
> >> > > . Gender of the first personal pronoun (subject or object form) to
> the
> >> > > right of the nameValues={MALE, FEMALE, UNCERTAIN, EMPTY}
> >> > > . Distance between the name and the first personal pronoun to the
> >> right
> >> > (in
> >> > > words) Values=NUMERIC
> >> > > . Gender of the second personal pronoun to the right of the
> >> > > name Values={MALE, FEMALE,
> UNCERTAIN,
> >> > > EMPTY}
> >> > > . Distance between the name and the second personal pronoun right
> >> > >  Values=NUMERIC
> >> > > . Gender of the third personal pronoun to the right of the
> >> > > name  Values={MALE, FEMALE,
> >> > UNCERTAIN,
> >> > > EMPTY}
> >> > > . Distance between the name and the third personal pronoun right (in
> >> > > words)  Values=NUMERIC
> >> > >
> >> > > . Gender of the first personal pronoun (subject or object form) to
> the
> >> > left
> >> > > of the name   Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> >> > > . Distance between the name and the first personal pronoun to the
> left
> >> > (in
> >> > > words)Values=NUMERIC
> >> > > . Gender of the second personal pronoun to the left of the
> >> > > nameValues={MALE, FEMALE,
> >> UNCERTAIN,
> >> > > EMPTY}
> >> > > . Distance between the name and the second personal pronoun left
> >> > > Values=NUMERIC
> >> > > . Gender of the third personal pronoun to the left of the
> >> > > nameValues={MALE, FEMALE,
> >> > > UNCERTAIN, EMPTY}
> >> > > . Distance between the name and the third personal pronoun left (in
> >> > > words)Values=NUMERIC
> >> > >
> >> > > In the second example here are the values you have for your features
> >> > >
> >> > 

Re: Model to detect the gender

2016-07-04 Thread Joern Kottmann
Hello,

there are also other interesting properties e.g. person title (e.g.
professor, doctor), job title/position,
company legal form. And much more for other entity types.

Maybe it would be worth it to build a dedicated component to extract
properties from entities.

Jörn

On Fri, Jul 1, 2016 at 3:05 PM, Mondher Bouazizi  wrote:

> Hi,
>
> Sorry for my late reply. I didn't understand well your last email, but here
> is what I meant:
>
> Given a simple dictionary you have that has the following columns:
>
> Name   Type   Gender
> Agatha First   F
> JohnFirst   M
> Smith  Both   B
>
> where:
> - "First" refers to first name, "Last" (not in the example) refers to last
> name, and Both means it can be both.
> - "F" refers to female, "M" refers to males, and "B" refers to both
> genders.
>
> and given the following two sentences:
>
> 1. "It was nice meeting you John. I hope we meet again soon."
>
> 2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and felt
> she knows something"
>
> In the first example, when you check in the dictionary, the name "John" is
> a male name, so no need to go any further.
> However, in the second example, the name "Smith", which is a family name in
> our case, can be fit for both, males and females. Therefore, we need to
> extract features from the surrounding context and perform a classification
> task.
> Here are some of the features I think they would be interesting to use:
>
> . Presence of a male initiative before the word {True, False}
> . Presence of a female initiative before the word {True, False}
>
> . Gender of the first personal pronoun (subject or object form) to the
> right of the nameValues={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the right (in
> words) Values=NUMERIC
> . Gender of the second personal pronoun to the right of the
> name Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun right
>  Values=NUMERIC
> . Gender of the third personal pronoun to the right of the
> name  Values={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the third personal pronoun right (in
> words)  Values=NUMERIC
>
> . Gender of the first personal pronoun (subject or object form) to the left
> of the name   Values={MALE, FEMALE, UNCERTAIN, EMPTY}
> . Distance between the name and the first personal pronoun to the left (in
> words)Values=NUMERIC
> . Gender of the second personal pronoun to the left of the
> nameValues={MALE, FEMALE, UNCERTAIN,
> EMPTY}
> . Distance between the name and the second personal pronoun left
> Values=NUMERIC
> . Gender of the third personal pronoun to the left of the
> nameValues={MALE, FEMALE,
> UNCERTAIN, EMPTY}
> . Distance between the name and the third personal pronoun left (in
> words)Values=NUMERIC
>
> In the second example here are the values you have for your features
>
> F1 = False
> F2 = True
> F3 = UNCERTAIN
> F4 = 1
> F5 = FEMALE
> F6 = 3
> F7 = FEMALE
> F8 = 4
> F9 = UNCERTAIN
> F10 = 2
> F11 = EMPTY
> F12 = 0
> F13 = EMPTY
> F14 = 0
>
> Of course the choice of features depends on the type of data, and the
> features themselves might not work well for some texts such as ones
> collected from twitter for example.
>
> I hope this help you.
>
> Best regards
>
> Mondher
>
>
> On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta 
> wrote:
>
> > Hi Mondher,
> > could you give me a raw example to understand how i should train the
> > classifier model?
> >
> > Thank you in advance!
> > Damiano
> >
> >
> > 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi :
> >
> > > Hi,
> > >
> > > I would recommend a hybrid approach where, in a first step, you use a
> > plain
> > > dictionary and then perform the classification if needed.
> > >
> > > It's straightforward, but I think it would present better performances
> > than
> > > just performing a classification task.
> > >
> > > In the first step you use a dictionary of names along with an attribute
> > > specifying whether the name fits for males, females or both. In case
> the
> > > name fits for males or females exclusively, then no need to go any
> > further.
> > >
> > > If the name fits for both genders, or is a family name etc., a second
> > step
> > > is needed where you extract features from the context (surrounding
> words,
> > > etc.) and perform a classification task using any machine learning
> > > algorithm.
> > >
> > > Another way would be using the information itself (whether the name
> fits
> > > for males, females or both) as a feature when you perform the
> > > classification.
> > 

Re: Model to detect the gender

2016-07-01 Thread Mondher Bouazizi
Hi,

Sorry for my late reply. I didn't understand well your last email, but here
is what I meant:

Given a simple dictionary you have that has the following columns:

Name   Type   Gender
Agatha First   F
JohnFirst   M
Smith  Both   B

where:
- "First" refers to first name, "Last" (not in the example) refers to last
name, and Both means it can be both.
- "F" refers to female, "M" refers to males, and "B" refers to both genders.

and given the following two sentences:

1. "It was nice meeting you John. I hope we meet again soon."

2. "Yes, I met Mrs. Smith. I asked her her opinion about the case and felt
she knows something"

In the first example, when you check in the dictionary, the name "John" is
a male name, so no need to go any further.
However, in the second example, the name "Smith", which is a family name in
our case, can be fit for both, males and females. Therefore, we need to
extract features from the surrounding context and perform a classification
task.
Here are some of the features I think they would be interesting to use:

. Presence of a male initiative before the word {True, False}
. Presence of a female initiative before the word {True, False}

. Gender of the first personal pronoun (subject or object form) to the
right of the nameValues={MALE, FEMALE, UNCERTAIN, EMPTY}
. Distance between the name and the first personal pronoun to the right (in
words) Values=NUMERIC
. Gender of the second personal pronoun to the right of the
name Values={MALE, FEMALE, UNCERTAIN, EMPTY}
. Distance between the name and the second personal pronoun right
 Values=NUMERIC
. Gender of the third personal pronoun to the right of the
name  Values={MALE, FEMALE, UNCERTAIN,
EMPTY}
. Distance between the name and the third personal pronoun right (in
words)  Values=NUMERIC

. Gender of the first personal pronoun (subject or object form) to the left
of the name   Values={MALE, FEMALE, UNCERTAIN, EMPTY}
. Distance between the name and the first personal pronoun to the left (in
words)Values=NUMERIC
. Gender of the second personal pronoun to the left of the
nameValues={MALE, FEMALE, UNCERTAIN,
EMPTY}
. Distance between the name and the second personal pronoun left
Values=NUMERIC
. Gender of the third personal pronoun to the left of the
nameValues={MALE, FEMALE,
UNCERTAIN, EMPTY}
. Distance between the name and the third personal pronoun left (in
words)Values=NUMERIC

In the second example here are the values you have for your features

F1 = False
F2 = True
F3 = UNCERTAIN
F4 = 1
F5 = FEMALE
F6 = 3
F7 = FEMALE
F8 = 4
F9 = UNCERTAIN
F10 = 2
F11 = EMPTY
F12 = 0
F13 = EMPTY
F14 = 0

Of course the choice of features depends on the type of data, and the
features themselves might not work well for some texts such as ones
collected from twitter for example.

I hope this help you.

Best regards

Mondher


On Thu, Jun 30, 2016 at 7:42 PM, Damiano Porta 
wrote:

> Hi Mondher,
> could you give me a raw example to understand how i should train the
> classifier model?
>
> Thank you in advance!
> Damiano
>
>
> 2016-06-30 6:57 GMT+02:00 Mondher Bouazizi :
>
> > Hi,
> >
> > I would recommend a hybrid approach where, in a first step, you use a
> plain
> > dictionary and then perform the classification if needed.
> >
> > It's straightforward, but I think it would present better performances
> than
> > just performing a classification task.
> >
> > In the first step you use a dictionary of names along with an attribute
> > specifying whether the name fits for males, females or both. In case the
> > name fits for males or females exclusively, then no need to go any
> further.
> >
> > If the name fits for both genders, or is a family name etc., a second
> step
> > is needed where you extract features from the context (surrounding words,
> > etc.) and perform a classification task using any machine learning
> > algorithm.
> >
> > Another way would be using the information itself (whether the name fits
> > for males, females or both) as a feature when you perform the
> > classification.
> >
> > Best regards,
> >
> > Mondher
> >
> > I am not sure
> >
> > On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta 
> > wrote:
> >
> > > Awesome! Thank you so much WIlliam!
> > >
> > > 2016-06-29 13:36 GMT+02:00 William Colen :
> > >
> > > > To create a NER model OpenNLP extracts features from the context,
> > things
> > > > such as: word prefix and suffix, next word, previous word, previous
> > word
> > > > prefix and suffix, next word prefix and suffix etc.
> > > > When you don't configure the feature generator it will apply the
> > default:
> > > >
> > > >

Re: Model to detect the gender

2016-06-30 Thread Damiano Porta
Hi Mondher,
could you give me a raw example to understand how i should train the
classifier model?

Thank you in advance!
Damiano


2016-06-30 6:57 GMT+02:00 Mondher Bouazizi :

> Hi,
>
> I would recommend a hybrid approach where, in a first step, you use a plain
> dictionary and then perform the classification if needed.
>
> It's straightforward, but I think it would present better performances than
> just performing a classification task.
>
> In the first step you use a dictionary of names along with an attribute
> specifying whether the name fits for males, females or both. In case the
> name fits for males or females exclusively, then no need to go any further.
>
> If the name fits for both genders, or is a family name etc., a second step
> is needed where you extract features from the context (surrounding words,
> etc.) and perform a classification task using any machine learning
> algorithm.
>
> Another way would be using the information itself (whether the name fits
> for males, females or both) as a feature when you perform the
> classification.
>
> Best regards,
>
> Mondher
>
> I am not sure
>
> On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta 
> wrote:
>
> > Awesome! Thank you so much WIlliam!
> >
> > 2016-06-29 13:36 GMT+02:00 William Colen :
> >
> > > To create a NER model OpenNLP extracts features from the context,
> things
> > > such as: word prefix and suffix, next word, previous word, previous
> word
> > > prefix and suffix, next word prefix and suffix etc.
> > > When you don't configure the feature generator it will apply the
> default:
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> > >
> > > Default feature generator:
> > >
> > > AdaptiveFeatureGenerator featureGenerator = *new*
> CachedFeatureGenerator(
> > >  *new* AdaptiveFeatureGenerator[]{
> > >*new* WindowFeatureGenerator(*new* TokenFeatureGenerator(),
> 2,
> > > 2),
> > >*new* WindowFeatureGenerator(*new*
> > > TokenClassFeatureGenerator(true), 2, 2),
> > >*new* OutcomePriorFeatureGenerator(),
> > >*new* PreviousMapFeatureGenerator(),
> > >*new* BigramNameFeatureGenerator(),
> > >*new* SentenceFeatureGenerator(true, false)
> > >});
> > >
> > >
> > > These default features should work for most cases (specially English),
> > but
> > > they of course can be incremented. If you do so, your model will take
> new
> > > features in account. So yes, you are putting the features in your
> model.
> > >
> > > To configure custom features is not easy. I would start with the
> default
> > > and use 10-fold cross-validation and take notes of its effectiveness.
> > Than
> > > change/add a feature, evaluate and take notes. Sometimes a feature that
> > we
> > > are sure would help can destroy the model effectiveness.
> > >
> > > Regards
> > > William
> > >
> > >
> > > 2016-06-29 7:00 GMT-03:00 Damiano Porta :
> > >
> > > > Thank you William! Really appreciated!
> > > >
> > > > I only do not get one point, when you said "You could increment your
> > > > model using
> > > > Custom Feature Generators" does it mean that i can "put" these
> features
> > > > inside ONE *.bin* file (model) that implement different things, or,
> > name
> > > > finder is one thing and those feature generators other?
> > > >
> > > > Thank you in advance for the clarification.
> > > >
> > > > 2016-06-29 1:23 GMT+02:00 William Colen :
> > > >
> > > > > Not exactly. You would create a new NER model to replace yours.
> > > > >
> > > > > In this approach you would need a corpus like this:
> > > > >
> > > > >  Pierre Vinken  , 61 years old , will join
> the
> > > > board
> > > > > as a nonexecutive director Nov. 29 .
> > > > > Mr .  Vinken  is chairman of Elsevier N.V. ,
> > the
> > > > > Dutch publishing group .  Jessie Robson 
> is
> > > > > retiring , she was a board member for 5 years .
> > > > >
> > > > >
> > > > > I am not an English native speaker, so I am not sure if the example
> > is
> > > > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > > > disambiguation.
> > > > >
> > > > > With a corpus big enough maybe you could create a model that
> outputs
> > > both
> > > > > classes, personMale and personFemale. To train a model you can
> follow
> > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > > >
> > > > > Let's say your results are not good enough. You could increment
> your
> > > > model
> > > > > using Custom Feature Generators (
> > > > >
> > > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > > and
> > > > >
> > > > >
> > > >
> > >
> >
> 

Re: Model to detect the gender

2016-06-29 Thread Mondher Bouazizi
Hi,

I would recommend a hybrid approach where, in a first step, you use a plain
dictionary and then perform the classification if needed.

It's straightforward, but I think it would present better performances than
just performing a classification task.

In the first step you use a dictionary of names along with an attribute
specifying whether the name fits for males, females or both. In case the
name fits for males or females exclusively, then no need to go any further.

If the name fits for both genders, or is a family name etc., a second step
is needed where you extract features from the context (surrounding words,
etc.) and perform a classification task using any machine learning
algorithm.

Another way would be using the information itself (whether the name fits
for males, females or both) as a feature when you perform the
classification.

Best regards,

Mondher

I am not sure

On Wed, Jun 29, 2016 at 10:27 PM, Damiano Porta 
wrote:

> Awesome! Thank you so much WIlliam!
>
> 2016-06-29 13:36 GMT+02:00 William Colen :
>
> > To create a NER model OpenNLP extracts features from the context, things
> > such as: word prefix and suffix, next word, previous word, previous word
> > prefix and suffix, next word prefix and suffix etc.
> > When you don't configure the feature generator it will apply the default:
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
> >
> > Default feature generator:
> >
> > AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
> >  *new* AdaptiveFeatureGenerator[]{
> >*new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2,
> > 2),
> >*new* WindowFeatureGenerator(*new*
> > TokenClassFeatureGenerator(true), 2, 2),
> >*new* OutcomePriorFeatureGenerator(),
> >*new* PreviousMapFeatureGenerator(),
> >*new* BigramNameFeatureGenerator(),
> >*new* SentenceFeatureGenerator(true, false)
> >});
> >
> >
> > These default features should work for most cases (specially English),
> but
> > they of course can be incremented. If you do so, your model will take new
> > features in account. So yes, you are putting the features in your model.
> >
> > To configure custom features is not easy. I would start with the default
> > and use 10-fold cross-validation and take notes of its effectiveness.
> Than
> > change/add a feature, evaluate and take notes. Sometimes a feature that
> we
> > are sure would help can destroy the model effectiveness.
> >
> > Regards
> > William
> >
> >
> > 2016-06-29 7:00 GMT-03:00 Damiano Porta :
> >
> > > Thank you William! Really appreciated!
> > >
> > > I only do not get one point, when you said "You could increment your
> > > model using
> > > Custom Feature Generators" does it mean that i can "put" these features
> > > inside ONE *.bin* file (model) that implement different things, or,
> name
> > > finder is one thing and those feature generators other?
> > >
> > > Thank you in advance for the clarification.
> > >
> > > 2016-06-29 1:23 GMT+02:00 William Colen :
> > >
> > > > Not exactly. You would create a new NER model to replace yours.
> > > >
> > > > In this approach you would need a corpus like this:
> > > >
> > > >  Pierre Vinken  , 61 years old , will join the
> > > board
> > > > as a nonexecutive director Nov. 29 .
> > > > Mr .  Vinken  is chairman of Elsevier N.V. ,
> the
> > > > Dutch publishing group .  Jessie Robson  is
> > > > retiring , she was a board member for 5 years .
> > > >
> > > >
> > > > I am not an English native speaker, so I am not sure if the example
> is
> > > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > > disambiguation.
> > > >
> > > > With a corpus big enough maybe you could create a model that outputs
> > both
> > > > classes, personMale and personFemale. To train a model you can follow
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > > >
> > > > Let's say your results are not good enough. You could increment your
> > > model
> > > > using Custom Feature Generators (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > > and
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > > ).
> > > >
> > > > One of the implemented featuregen can take a dictionary (
> > > >
> > > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > > ).
> > > > You can also implement other convenient FeatureGenerator, for
> instance
> > > > regex.
> > > >
> > > > Again, it is just a wild guess of how to implement 

Re: Model to detect the gender

2016-06-29 Thread Damiano Porta
Awesome! Thank you so much WIlliam!

2016-06-29 13:36 GMT+02:00 William Colen :

> To create a NER model OpenNLP extracts features from the context, things
> such as: word prefix and suffix, next word, previous word, previous word
> prefix and suffix, next word prefix and suffix etc.
> When you don't configure the feature generator it will apply the default:
>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api
>
> Default feature generator:
>
> AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
>  *new* AdaptiveFeatureGenerator[]{
>*new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2,
> 2),
>*new* WindowFeatureGenerator(*new*
> TokenClassFeatureGenerator(true), 2, 2),
>*new* OutcomePriorFeatureGenerator(),
>*new* PreviousMapFeatureGenerator(),
>*new* BigramNameFeatureGenerator(),
>*new* SentenceFeatureGenerator(true, false)
>});
>
>
> These default features should work for most cases (specially English), but
> they of course can be incremented. If you do so, your model will take new
> features in account. So yes, you are putting the features in your model.
>
> To configure custom features is not easy. I would start with the default
> and use 10-fold cross-validation and take notes of its effectiveness. Than
> change/add a feature, evaluate and take notes. Sometimes a feature that we
> are sure would help can destroy the model effectiveness.
>
> Regards
> William
>
>
> 2016-06-29 7:00 GMT-03:00 Damiano Porta :
>
> > Thank you William! Really appreciated!
> >
> > I only do not get one point, when you said "You could increment your
> > model using
> > Custom Feature Generators" does it mean that i can "put" these features
> > inside ONE *.bin* file (model) that implement different things, or, name
> > finder is one thing and those feature generators other?
> >
> > Thank you in advance for the clarification.
> >
> > 2016-06-29 1:23 GMT+02:00 William Colen :
> >
> > > Not exactly. You would create a new NER model to replace yours.
> > >
> > > In this approach you would need a corpus like this:
> > >
> > >  Pierre Vinken  , 61 years old , will join the
> > board
> > > as a nonexecutive director Nov. 29 .
> > > Mr .  Vinken  is chairman of Elsevier N.V. , the
> > > Dutch publishing group .  Jessie Robson  is
> > > retiring , she was a board member for 5 years .
> > >
> > >
> > > I am not an English native speaker, so I am not sure if the example is
> > > clear enough. I tried to use Jessie as a neutral name and "she" as
> > > disambiguation.
> > >
> > > With a corpus big enough maybe you could create a model that outputs
> both
> > > classes, personMale and personFemale. To train a model you can follow
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> > >
> > > Let's say your results are not good enough. You could increment your
> > model
> > > using Custom Feature Generators (
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > > and
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > > ).
> > >
> > > One of the implemented featuregen can take a dictionary (
> > >
> > >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > > ).
> > > You can also implement other convenient FeatureGenerator, for instance
> > > regex.
> > >
> > > Again, it is just a wild guess of how to implement it. I don't know if
> it
> > > would perform well. I was only thinking how to implement a gender ML
> > model
> > > that uses the surrounding context.
> > >
> > > Hope I could clarify.
> > >
> > > William
> > >
> > > 2016-06-28 19:15 GMT-03:00 Damiano Porta :
> > >
> > > > Hi William,
> > > > Ok, so you are talking about a kind of pipe where we execute:
> > > >
> > > > 1. NER (personM for example)
> > > > 2. Regex (filter to reduce false positives)
> > > > 3. Plain dictionary (filter as above) ?
> > > >
> > > > Yes we can split out model in two for M and F, it is not a big
> problem,
> > > we
> > > > have a database grouped by gender.
> > > >
> > > > I only have a doubt regarding the use of a dictionary. Because if we
> > use
> > > a
> > > > dictionary to create the model, we could only use it to detect names
> > > > without using NER. No?
> > > >
> > > >
> > > >
> > > > 2016-06-29 0:10 GMT+02:00 William Colen :
> > > >
> > > > > Do you plan to use the surrounding context? If yes, maybe you could
> > try
> > > > to
> > > > > split NER in two categories: PersonM and PersonF. Just an idea,
> never
> > > > read
> > > > > or tried anything 

Re: Model to detect the gender

2016-06-29 Thread William Colen
To create a NER model OpenNLP extracts features from the context, things
such as: word prefix and suffix, next word, previous word, previous word
prefix and suffix, next word prefix and suffix etc.
When you don't configure the feature generator it will apply the default:
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen.api

Default feature generator:

AdaptiveFeatureGenerator featureGenerator = *new* CachedFeatureGenerator(
 *new* AdaptiveFeatureGenerator[]{
   *new* WindowFeatureGenerator(*new* TokenFeatureGenerator(), 2, 2),
   *new* WindowFeatureGenerator(*new*
TokenClassFeatureGenerator(true), 2, 2),
   *new* OutcomePriorFeatureGenerator(),
   *new* PreviousMapFeatureGenerator(),
   *new* BigramNameFeatureGenerator(),
   *new* SentenceFeatureGenerator(true, false)
   });


These default features should work for most cases (specially English), but
they of course can be incremented. If you do so, your model will take new
features in account. So yes, you are putting the features in your model.

To configure custom features is not easy. I would start with the default
and use 10-fold cross-validation and take notes of its effectiveness. Than
change/add a feature, evaluate and take notes. Sometimes a feature that we
are sure would help can destroy the model effectiveness.

Regards
William


2016-06-29 7:00 GMT-03:00 Damiano Porta :

> Thank you William! Really appreciated!
>
> I only do not get one point, when you said "You could increment your
> model using
> Custom Feature Generators" does it mean that i can "put" these features
> inside ONE *.bin* file (model) that implement different things, or, name
> finder is one thing and those feature generators other?
>
> Thank you in advance for the clarification.
>
> 2016-06-29 1:23 GMT+02:00 William Colen :
>
> > Not exactly. You would create a new NER model to replace yours.
> >
> > In this approach you would need a corpus like this:
> >
> >  Pierre Vinken  , 61 years old , will join the
> board
> > as a nonexecutive director Nov. 29 .
> > Mr .  Vinken  is chairman of Elsevier N.V. , the
> > Dutch publishing group .  Jessie Robson  is
> > retiring , she was a board member for 5 years .
> >
> >
> > I am not an English native speaker, so I am not sure if the example is
> > clear enough. I tried to use Jessie as a neutral name and "she" as
> > disambiguation.
> >
> > With a corpus big enough maybe you could create a model that outputs both
> > classes, personMale and personFemale. To train a model you can follow
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
> >
> > Let's say your results are not good enough. You could increment your
> model
> > using Custom Feature Generators (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> > and
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> > ).
> >
> > One of the implemented featuregen can take a dictionary (
> >
> >
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> > ).
> > You can also implement other convenient FeatureGenerator, for instance
> > regex.
> >
> > Again, it is just a wild guess of how to implement it. I don't know if it
> > would perform well. I was only thinking how to implement a gender ML
> model
> > that uses the surrounding context.
> >
> > Hope I could clarify.
> >
> > William
> >
> > 2016-06-28 19:15 GMT-03:00 Damiano Porta :
> >
> > > Hi William,
> > > Ok, so you are talking about a kind of pipe where we execute:
> > >
> > > 1. NER (personM for example)
> > > 2. Regex (filter to reduce false positives)
> > > 3. Plain dictionary (filter as above) ?
> > >
> > > Yes we can split out model in two for M and F, it is not a big problem,
> > we
> > > have a database grouped by gender.
> > >
> > > I only have a doubt regarding the use of a dictionary. Because if we
> use
> > a
> > > dictionary to create the model, we could only use it to detect names
> > > without using NER. No?
> > >
> > >
> > >
> > > 2016-06-29 0:10 GMT+02:00 William Colen :
> > >
> > > > Do you plan to use the surrounding context? If yes, maybe you could
> try
> > > to
> > > > split NER in two categories: PersonM and PersonF. Just an idea, never
> > > read
> > > > or tried anything like it. You would need a training corpus with
> these
> > > > classes.
> > > >
> > > > You could add both the plain dictionary and the regex as NER features
> > as
> > > > well and check how it improves.
> > > >
> > > > 2016-06-28 18:56 GMT-03:00 Damiano Porta :
> > > >
> > > > > Hello everybody,
> > > > >
> > > > > we built a NER model to 

Re: Model to detect the gender

2016-06-29 Thread Damiano Porta
Thank you William! Really appreciated!

I only do not get one point, when you said "You could increment your
model using
Custom Feature Generators" does it mean that i can "put" these features
inside ONE *.bin* file (model) that implement different things, or, name
finder is one thing and those feature generators other?

Thank you in advance for the clarification.

2016-06-29 1:23 GMT+02:00 William Colen :

> Not exactly. You would create a new NER model to replace yours.
>
> In this approach you would need a corpus like this:
>
>  Pierre Vinken  , 61 years old , will join the board
> as a nonexecutive director Nov. 29 .
> Mr .  Vinken  is chairman of Elsevier N.V. , the
> Dutch publishing group .  Jessie Robson  is
> retiring , she was a board member for 5 years .
>
>
> I am not an English native speaker, so I am not sure if the example is
> clear enough. I tried to use Jessie as a neutral name and "she" as
> disambiguation.
>
> With a corpus big enough maybe you could create a model that outputs both
> classes, personMale and personFemale. To train a model you can follow
>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training
>
> Let's say your results are not good enough. You could increment your model
> using Custom Feature Generators (
>
> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
> and
>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
> ).
>
> One of the implemented featuregen can take a dictionary (
>
> https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
> ).
> You can also implement other convenient FeatureGenerator, for instance
> regex.
>
> Again, it is just a wild guess of how to implement it. I don't know if it
> would perform well. I was only thinking how to implement a gender ML model
> that uses the surrounding context.
>
> Hope I could clarify.
>
> William
>
> 2016-06-28 19:15 GMT-03:00 Damiano Porta :
>
> > Hi William,
> > Ok, so you are talking about a kind of pipe where we execute:
> >
> > 1. NER (personM for example)
> > 2. Regex (filter to reduce false positives)
> > 3. Plain dictionary (filter as above) ?
> >
> > Yes we can split out model in two for M and F, it is not a big problem,
> we
> > have a database grouped by gender.
> >
> > I only have a doubt regarding the use of a dictionary. Because if we use
> a
> > dictionary to create the model, we could only use it to detect names
> > without using NER. No?
> >
> >
> >
> > 2016-06-29 0:10 GMT+02:00 William Colen :
> >
> > > Do you plan to use the surrounding context? If yes, maybe you could try
> > to
> > > split NER in two categories: PersonM and PersonF. Just an idea, never
> > read
> > > or tried anything like it. You would need a training corpus with these
> > > classes.
> > >
> > > You could add both the plain dictionary and the regex as NER features
> as
> > > well and check how it improves.
> > >
> > > 2016-06-28 18:56 GMT-03:00 Damiano Porta :
> > >
> > > > Hello everybody,
> > > >
> > > > we built a NER model to find persons (name) inside our documents.
> > > > We are looking for the best approach to understand if the name is
> > > > male/female.
> > > >
> > > > Possible solutions:
> > > > - Plain dictionary?
> > > > - Regex to check the initial and/letters of the name?
> > > > - Classifier? (naive bayes? Maxent?)
> > > >
> > > > Thanks
> > > >
> > >
> >
>


Re: Model to detect the gender

2016-06-28 Thread William Colen
Not exactly. You would create a new NER model to replace yours.

In this approach you would need a corpus like this:

 Pierre Vinken  , 61 years old , will join the board
as a nonexecutive director Nov. 29 .
Mr .  Vinken  is chairman of Elsevier N.V. , the
Dutch publishing group .  Jessie Robson  is
retiring , she was a board member for 5 years .


I am not an English native speaker, so I am not sure if the example is
clear enough. I tried to use Jessie as a neutral name and "she" as
disambiguation.

With a corpus big enough maybe you could create a model that outputs both
classes, personMale and personFemale. To train a model you can follow
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training

Let's say your results are not good enough. You could increment your model
using Custom Feature Generators (
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
and
https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/package-summary.html
).

One of the implemented featuregen can take a dictionary (
https://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/util/featuregen/DictionaryFeatureGenerator.html
).
You can also implement other convenient FeatureGenerator, for instance
regex.

Again, it is just a wild guess of how to implement it. I don't know if it
would perform well. I was only thinking how to implement a gender ML model
that uses the surrounding context.

Hope I could clarify.

William

2016-06-28 19:15 GMT-03:00 Damiano Porta :

> Hi William,
> Ok, so you are talking about a kind of pipe where we execute:
>
> 1. NER (personM for example)
> 2. Regex (filter to reduce false positives)
> 3. Plain dictionary (filter as above) ?
>
> Yes we can split out model in two for M and F, it is not a big problem, we
> have a database grouped by gender.
>
> I only have a doubt regarding the use of a dictionary. Because if we use a
> dictionary to create the model, we could only use it to detect names
> without using NER. No?
>
>
>
> 2016-06-29 0:10 GMT+02:00 William Colen :
>
> > Do you plan to use the surrounding context? If yes, maybe you could try
> to
> > split NER in two categories: PersonM and PersonF. Just an idea, never
> read
> > or tried anything like it. You would need a training corpus with these
> > classes.
> >
> > You could add both the plain dictionary and the regex as NER features as
> > well and check how it improves.
> >
> > 2016-06-28 18:56 GMT-03:00 Damiano Porta :
> >
> > > Hello everybody,
> > >
> > > we built a NER model to find persons (name) inside our documents.
> > > We are looking for the best approach to understand if the name is
> > > male/female.
> > >
> > > Possible solutions:
> > > - Plain dictionary?
> > > - Regex to check the initial and/letters of the name?
> > > - Classifier? (naive bayes? Maxent?)
> > >
> > > Thanks
> > >
> >
>


Re: Model to detect the gender

2016-06-28 Thread Damiano Porta
Hi William,
Ok, so you are talking about a kind of pipe where we execute:

1. NER (personM for example)
2. Regex (filter to reduce false positives)
3. Plain dictionary (filter as above) ?

Yes we can split out model in two for M and F, it is not a big problem, we
have a database grouped by gender.

I only have a doubt regarding the use of a dictionary. Because if we use a
dictionary to create the model, we could only use it to detect names
without using NER. No?



2016-06-29 0:10 GMT+02:00 William Colen :

> Do you plan to use the surrounding context? If yes, maybe you could try to
> split NER in two categories: PersonM and PersonF. Just an idea, never read
> or tried anything like it. You would need a training corpus with these
> classes.
>
> You could add both the plain dictionary and the regex as NER features as
> well and check how it improves.
>
> 2016-06-28 18:56 GMT-03:00 Damiano Porta :
>
> > Hello everybody,
> >
> > we built a NER model to find persons (name) inside our documents.
> > We are looking for the best approach to understand if the name is
> > male/female.
> >
> > Possible solutions:
> > - Plain dictionary?
> > - Regex to check the initial and/letters of the name?
> > - Classifier? (naive bayes? Maxent?)
> >
> > Thanks
> >
>


Re: Model to detect the gender

2016-06-28 Thread William Colen
Do you plan to use the surrounding context? If yes, maybe you could try to
split NER in two categories: PersonM and PersonF. Just an idea, never read
or tried anything like it. You would need a training corpus with these
classes.

You could add both the plain dictionary and the regex as NER features as
well and check how it improves.

2016-06-28 18:56 GMT-03:00 Damiano Porta :

> Hello everybody,
>
> we built a NER model to find persons (name) inside our documents.
> We are looking for the best approach to understand if the name is
> male/female.
>
> Possible solutions:
> - Plain dictionary?
> - Regex to check the initial and/letters of the name?
> - Classifier? (naive bayes? Maxent?)
>
> Thanks
>