Thanks Feng and Julien for your replies. I will take a look at both options and update what worked.
Regards, -- Chethan Prasad On Mon, May 5, 2014 at 12:10 AM, Julien Nioche < [email protected]> wrote: > Chethan > > Have a look at Behemoth [https://github.com/DigitalPebble/behemoth] if you > haven't already done so. Porting the code from the GATE module into an > IndexingFilter should not be too difficult. What we do there is that the > GATE pipeline is stored on HDFS and loaded by the slaves via the > distributed cache. > > Alternatively you could use the Nutch just for crawling then use the Nutch > and GATE modules of Behemoth as well as the SOLR or ElasticSearch ones if > that's what you want to do. > > HTH > > Julien > > > On 4 May 2014 06:52, chethan <[email protected]> wrote: > > > I have setup Nutch to crawl on Amazon EMR and I have a plugin that > > uses GATE<https://gate.ac.uk/> for > > text processing in the Indexing filters. GATE requires certain static > > resources (some xmls and text files) to be loaded for it to be > initialized. > > I tried to bundle these resources in the job jar and load them from the > > classpath but that didn't work. I also tried copying them to HDFS and > > loading them from there but that too failed. > > > > What is the best way to bundle such static resources and reference them > in > > the Indexing filters? I am working on copying the file to the distributed > > cache and loading it from there but I wanted to know how others are > > handling this. Thanks. > > > > Regards, > > > > -- > > Chethan Prasad > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

