Re: Nutch + GATE on Amazon EMR

chethan Sun, 04 May 2014 23:45:23 -0700

Thanks Feng and Julien for your replies. I will take a look at both options
and update what worked.


Regards,

--
Chethan Prasad


On Mon, May 5, 2014 at 12:10 AM, Julien Nioche <
[email protected]> wrote:

> Chethan
>
> Have a look at Behemoth [https://github.com/DigitalPebble/behemoth] if you
> haven't already done so. Porting the code from the GATE module into an
> IndexingFilter should not be too difficult. What we do there is that the
> GATE pipeline is stored on HDFS and loaded by the slaves via the
> distributed cache.
>
> Alternatively you could use the Nutch just for crawling then use the Nutch
> and GATE modules of Behemoth as well as the SOLR or ElasticSearch ones if
> that's what you want to do.
>
> HTH
>
> Julien
>
>
> On 4 May 2014 06:52, chethan <[email protected]> wrote:
>
> > I have setup Nutch to crawl on Amazon EMR and I have a plugin that
> > uses GATE<https://gate.ac.uk/> for
> > text processing in the Indexing filters. GATE requires certain static
> > resources (some xmls and text files) to be loaded for it to be
> initialized.
> > I tried to bundle these resources in the job jar and load them from the
> > classpath but that didn't work. I also tried copying them to HDFS and
> > loading them from there but that too failed.
> >
> > What is the best way to bundle such static resources and reference them
> in
> > the Indexing filters? I am working on copying the file to the distributed
> > cache and loading it from there but I wanted to know how others are
> > handling this. Thanks.
> >
> > Regards,
> >
> > --
> > Chethan Prasad
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch + GATE on Amazon EMR

Reply via email to