Re: webapp for Nutch deploy mode

Lewis John McGibbney Thu, 18 Oct 2018 12:24:38 -0700

Hi Gahanna,
Response inline

On 2018/10/12 07:40:50, Gajanan Watkar <gajananwat...@gmail.com> wrote: 
> Hi all,
> I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of
> Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to
> solr-6.5.1.
> I want to use *webapp* for creating, controlling and monitoring crawl jobs
> in deploy mode.
> 
> With Hadoop cluster, Hbase and nutchserver started, when I tried to launch
> Crawl Job through webapp interfaces InjectorJob failed.
> It was happening  due to seed directory being created on local filesystem.
> I fixed it by moving it to same path on HDFS by editing *createSeedFile*
> method in *org.apache.nutch.api.resources.SeedResource.java*.
> 
> public String createSeedFile(SeedList seedList) {
>     if (seedList == null) {
>       throw new WebApplicationException(Response.status(Status.BAD_REQUEST)
>           .entity("Seed list cannot be empty!").build());
>     }
>     File seedFile = createSeedFile();
>     BufferedWriter writer = getWriter(seedFile);
> 
>     Collection<SeedUrl> seedUrls = seedList.getSeedUrls();
>     if (CollectionUtils.isNotEmpty(seedUrls)) {
>       for (SeedUrl seedUrl : seedUrls) {
>         writeUrl(writer, seedUrl);
>       }
>     }
> 
> 
> * //method to copy seed directory to HDFS: Gajanan*
> *    copyDataToHDFS(seedFile);*
> 
>     return seedFile.getParent();
>   }


I was aware of this some time ago and never found the time to fix it. I just 
checked JIRA as well and there is no ticket for addressing the task however I 
am certain that it has been discussed on this mailing list previously.
Anyway, can you please create an issue in JIRA labeling it as affecting 2.x and 
tag it with both "REST_api" and "web gui" and submit this as a pull request. It 
would be a huge help.
> 
> Then I was able to go upto index phase where it complained of not having
> set *solr.server.url* java property.
> *I set JAVA_TOOL_OPTIONS to include -Dsolr.server.url property.*
> 
> *Crawl Job is is still failing with:*
> 18/10/11 10:07:03 ERROR impl.RemoteCommandExecutor: Remote command failed
> java.util.concurrent.TimeoutException
>     at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>     at
> org.apache.nutch.webui.client.impl.RemoteCommandExecutor.executeRemoteJob(RemoteCommandExecutor.java:61)
>     at
> org.apache.nutch.webui.client.impl.CrawlingCycle.executeCrawlCycle(CrawlingCycle.java:58)
>     at
> org.apache.nutch.webui.service.impl.CrawlServiceImpl.startCrawl(CrawlServiceImpl.java:69)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at
> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
>     at
> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
>     at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
>     at
> org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:97)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> 
> I tried to change default timeout in
> *org.apache.nutch.webui.client.impl.RemoteCommandExecutor
> *
> 
> private static final int *DEFAULT_TIMEOUT_SEC = 300;  *//Can be increased
> if required

There are various issues also about this in JIRA. Can you please check them out 
and let me know if you can find the correct on. Maybe the following? 
https://issues.apache.org/jira/browse/NUTCH-2313
> 
> *Summary:*
> *But in all this, what i am wondering about is:*
> *1. No webpage table is being created in hbase corresponding to crawl ID.*

Again, please check JIRA for this information, there may already be something 
logged which will indicate what is wrong.

> *2. How in that case it goes upto Index phase of crawl.*

It shouldn't!

> 
> *Finally actual question:*
> 
> *How do I get my crawl jobs running in deploy mode using nutch webapp.
> What else I need to do. Am I missing something very basic.*

As far as I can remember this functionality has not been baked in... or else it 
may have been baked in but it is within 2.x from Git. Please check out the code 
from Git and try it there... your results may differ.

Lewis

Re: webapp for Nutch deploy mode

Reply via email to