webapp for Nutch deploy mode

Gajanan Watkar Fri, 12 Oct 2018 00:41:17 -0700

Hi all,
I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of
Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to
solr-6.5.1.
I want to use *webapp* for creating, controlling and monitoring crawl jobs
in deploy mode.


With Hadoop cluster, Hbase and nutchserver started, when I tried to launch
Crawl Job through webapp interfaces InjectorJob failed.
It was happening  due to seed directory being created on local filesystem.
I fixed it by moving it to same path on HDFS by editing *createSeedFile*
method in *org.apache.nutch.api.resources.SeedResource.java*.

public String createSeedFile(SeedList seedList) {
    if (seedList == null) {
      throw new WebApplicationException(Response.status(Status.BAD_REQUEST)
          .entity("Seed list cannot be empty!").build());
    }
    File seedFile = createSeedFile();
    BufferedWriter writer = getWriter(seedFile);

    Collection<SeedUrl> seedUrls = seedList.getSeedUrls();
    if (CollectionUtils.isNotEmpty(seedUrls)) {
      for (SeedUrl seedUrl : seedUrls) {
        writeUrl(writer, seedUrl);
      }
    }


* //method to copy seed directory to HDFS: Gajanan*
*    copyDataToHDFS(seedFile);*

    return seedFile.getParent();
  }

Then I was able to go upto index phase where it complained of not having
set *solr.server.url* java property.
*I set JAVA_TOOL_OPTIONS to include -Dsolr.server.url property.*

*Crawl Job is is still failing with:*
18/10/11 10:07:03 ERROR impl.RemoteCommandExecutor: Remote command failed
java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask.get(FutureTask.java:205)
    at
org.apache.nutch.webui.client.impl.RemoteCommandExecutor.executeRemoteJob(RemoteCommandExecutor.java:61)
    at
org.apache.nutch.webui.client.impl.CrawlingCycle.executeCrawlCycle(CrawlingCycle.java:58)
    at
org.apache.nutch.webui.service.impl.CrawlServiceImpl.startCrawl(CrawlServiceImpl.java:69)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at
org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
    at
org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
    at
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
    at
org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:97)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I tried to change default timeout in
*org.apache.nutch.webui.client.impl.RemoteCommandExecutor
*

private static final int *DEFAULT_TIMEOUT_SEC = 300;  *//Can be increased
if required

*Summary:*
*But in all this, what i am wondering about is:*
*1. No webpage table is being created in hbase corresponding to crawl ID.*
*2. How in that case it goes upto Index phase of crawl.*

*Finally actual question:*

*How do I get my crawl jobs running in deploy mode using nutch webapp.
What else I need to do. Am I missing something very basic.*

webapp for Nutch deploy mode

Reply via email to