I have setup Nutch to crawl on Amazon EMR and I have a plugin that uses GATE<https://gate.ac.uk/> for text processing in the Indexing filters. GATE requires certain static resources (some xmls and text files) to be loaded for it to be initialized. I tried to bundle these resources in the job jar and load them from the classpath but that didn't work. I also tried copying them to HDFS and loading them from there but that too failed.
What is the best way to bundle such static resources and reference them in the Indexing filters? I am working on copying the file to the distributed cache and loading it from there but I wanted to know how others are handling this. Thanks. Regards, -- Chethan Prasad

