Hi all, I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to solr-6.5.1. I want to use *webapp* for creating, controlling and monitoring crawl jobs in deploy mode.
With Hadoop cluster, Hbase and nutchserver started, when I tried to launch Crawl Job through webapp interfaces InjectorJob failed. It was happening due to seed directory being created on local filesystem. I fixed it by moving it to same path on HDFS by editing *createSeedFile* method in *org.apache.nutch.api.resources.SeedResource.java*. public String createSeedFile(SeedList seedList) { if (seedList == null) { throw new WebApplicationException(Response.status(Status.BAD_REQUEST) .entity("Seed list cannot be empty!").build()); } File seedFile = createSeedFile(); BufferedWriter writer = getWriter(seedFile); Collection<SeedUrl> seedUrls = seedList.getSeedUrls(); if (CollectionUtils.isNotEmpty(seedUrls)) { for (SeedUrl seedUrl : seedUrls) { writeUrl(writer, seedUrl); } } * //method to copy seed directory to HDFS: Gajanan* * copyDataToHDFS(seedFile);* return seedFile.getParent(); } Then I was able to go upto index phase where it complained of not having set *solr.server.url* java property. *I set JAVA_TOOL_OPTIONS to include -Dsolr.server.url property.* *Crawl Job is is still failing with:* 18/10/11 10:07:03 ERROR impl.RemoteCommandExecutor: Remote command failed java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.nutch.webui.client.impl.RemoteCommandExecutor.executeRemoteJob(RemoteCommandExecutor.java:61) at org.apache.nutch.webui.client.impl.CrawlingCycle.executeCrawlCycle(CrawlingCycle.java:58) at org.apache.nutch.webui.service.impl.CrawlServiceImpl.startCrawl(CrawlServiceImpl.java:69) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157) at org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:97) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) I tried to change default timeout in *org.apache.nutch.webui.client.impl.RemoteCommandExecutor * private static final int *DEFAULT_TIMEOUT_SEC = 300; *//Can be increased if required *Summary:* *But in all this, what i am wondering about is:* *1. No webpage table is being created in hbase corresponding to crawl ID.* *2. How in that case it goes upto Index phase of crawl.* *Finally actual question:* *How do I get my crawl jobs running in deploy mode using nutch webapp. What else I need to do. Am I missing something very basic.*