Chethan Have a look at Behemoth [https://github.com/DigitalPebble/behemoth] if you haven't already done so. Porting the code from the GATE module into an IndexingFilter should not be too difficult. What we do there is that the GATE pipeline is stored on HDFS and loaded by the slaves via the distributed cache.
Alternatively you could use the Nutch just for crawling then use the Nutch and GATE modules of Behemoth as well as the SOLR or ElasticSearch ones if that's what you want to do. HTH Julien On 4 May 2014 06:52, chethan <[email protected]> wrote: > I have setup Nutch to crawl on Amazon EMR and I have a plugin that > uses GATE<https://gate.ac.uk/> for > text processing in the Indexing filters. GATE requires certain static > resources (some xmls and text files) to be loaded for it to be initialized. > I tried to bundle these resources in the job jar and load them from the > classpath but that didn't work. I also tried copying them to HDFS and > loading them from there but that too failed. > > What is the best way to bundle such static resources and reference them in > the Indexing filters? I am working on copying the file to the distributed > cache and loading it from there but I wanted to know how others are > handling this. Thanks. > > Regards, > > -- > Chethan Prasad > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

