Also, I'm not able to see any logs generated by the plugin or Nutch base
classes. There are lots of Hadoop logs, but none from Nutch. Any idea what
could be the case?

Regards,

--
Chethan Prasad


On Mon, May 5, 2014 at 12:14 PM, chethan <[email protected]> wrote:

> Thanks Feng and Julien for your replies. I will take a look at both
> options and update what worked.
>
> Regards,
>
> --
> Chethan Prasad
>
>
> On Mon, May 5, 2014 at 12:10 AM, Julien Nioche <
> [email protected]> wrote:
>
>> Chethan
>>
>> Have a look at Behemoth [https://github.com/DigitalPebble/behemoth] if
>> you
>> haven't already done so. Porting the code from the GATE module into an
>> IndexingFilter should not be too difficult. What we do there is that the
>> GATE pipeline is stored on HDFS and loaded by the slaves via the
>> distributed cache.
>>
>> Alternatively you could use the Nutch just for crawling then use the Nutch
>> and GATE modules of Behemoth as well as the SOLR or ElasticSearch ones if
>> that's what you want to do.
>>
>> HTH
>>
>> Julien
>>
>>
>> On 4 May 2014 06:52, chethan <[email protected]> wrote:
>>
>> > I have setup Nutch to crawl on Amazon EMR and I have a plugin that
>> > uses GATE<https://gate.ac.uk/> for
>> > text processing in the Indexing filters. GATE requires certain static
>> > resources (some xmls and text files) to be loaded for it to be
>> initialized.
>> > I tried to bundle these resources in the job jar and load them from the
>> > classpath but that didn't work. I also tried copying them to HDFS and
>> > loading them from there but that too failed.
>> >
>> > What is the best way to bundle such static resources and reference them
>> in
>> > the Indexing filters? I am working on copying the file to the
>> distributed
>> > cache and loading it from there but I wanted to know how others are
>> > handling this. Thanks.
>> >
>> > Regards,
>> >
>> > --
>> > Chethan Prasad
>> >
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>

Reply via email to