Re: Training NameFinder with large corpus

Jeffrey Zemerick Mon, 07 Oct 2013 06:43:21 -0700

Gao,

I have about a 950 MB file created by Hadoop with sentences in the format
described in the NameFinder training documentation (
http://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind.training.tool).
I'm running the jar as described on that page and I set the number of
iterations to 50. (I read somewhere that was a suggested amount.) After the
first failed attempt I increased the memory to 4096 but it failed again
(just took longer to fail). I can increase the memory further but I wanted
to see if there was anything that I was missing.


Thanks,
Jeff



On Mon, Oct 7, 2013 at 9:29 AM, melo <[email protected]> wrote:

> Jeff,
>
> Would you please tell us what exactly kind of method are you using?
>
> Are you calling the .jar file? or u r writing  a new class to use the
> model.
>
> honestly speaking, I don't think you should get involve with hadoop.
> It is supposed to handle tremendously more data than yours 1Giga.
> By tremendous, I mean TeraByte, maybe PetaByte.
>
> There is always a way.
> Learning Hadoop is not so hard, but why bother?
>
> Gao
>
> On 2013/10/07, at 22:21, Mark G <[email protected]> wrote:
>
> > Also, Map Reduce will allow you to write the annotated sentences to HDFS
> as
> > part files, but at some point those files will have to be merged and the
> > model created from them. In Map Reduce you may find that all your part
> > files end up on the same reducer node and you end up with the same
> problem
> > on a random data node.
> > Seems like this would only work if you could append one MODEL with
> another
> > without recalculation.
> >
> >
> > On Mon, Oct 7, 2013 at 8:23 AM, Jörn Kottmann <[email protected]>
> wrote:
> >
> >> On 10/07/2013 02:05 PM, Jeffrey Zemerick wrote:
> >>
> >>> Thanks. I used MapReduce to build the training input. I didn't realize
> >>> that
> >>> the training can also be performed on Hadoop. Can I simply combine the
> >>> generated models at the completion of the job?
> >>>
> >>
> >> That will not be an out of the box experience, you need to modify
> OpenNLP
> >> to write the training events
> >> to a file and then use a trainer which can run on Hadoop e.g. Mahout.
>  We
> >> now almost have support
> >> to integrate 3rd party ml libraries into OpenNLP.
> >>
> >> Jörn
> >>
>
>

Re: Training NameFinder with large corpus

Reply via email to