Nutch + GATE on Amazon EMR

chethan Sat, 03 May 2014 22:53:22 -0700

I have setup Nutch to crawl on Amazon EMR and I have a plugin that
uses GATE<https://gate.ac.uk/> for
text processing in the Indexing filters. GATE requires certain static
resources (some xmls and text files) to be loaded for it to be initialized.
I tried to bundle these resources in the job jar and load them from the
classpath but that didn't work. I also tried copying them to HDFS and
loading them from there but that too failed.


What is the best way to bundle such static resources and reference them in
the Indexing filters? I am working on copying the file to the distributed
cache and loading it from there but I wanted to know how others are
handling this. Thanks.

Regards,

--
Chethan Prasad

Nutch + GATE on Amazon EMR

Reply via email to