Chethan

Have a look at Behemoth [https://github.com/DigitalPebble/behemoth] if you
haven't already done so. Porting the code from the GATE module into an
IndexingFilter should not be too difficult. What we do there is that the
GATE pipeline is stored on HDFS and loaded by the slaves via the
distributed cache.

Alternatively you could use the Nutch just for crawling then use the Nutch
and GATE modules of Behemoth as well as the SOLR or ElasticSearch ones if
that's what you want to do.

HTH

Julien


On 4 May 2014 06:52, chethan <[email protected]> wrote:

> I have setup Nutch to crawl on Amazon EMR and I have a plugin that
> uses GATE<https://gate.ac.uk/> for
> text processing in the Indexing filters. GATE requires certain static
> resources (some xmls and text files) to be loaded for it to be initialized.
> I tried to bundle these resources in the job jar and load them from the
> classpath but that didn't work. I also tried copying them to HDFS and
> loading them from there but that too failed.
>
> What is the best way to bundle such static resources and reference them in
> the Indexing filters? I am working on copying the file to the distributed
> cache and loading it from there but I wanted to know how others are
> handling this. Thanks.
>
> Regards,
>
> --
> Chethan Prasad
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to