RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!
Done - NUTCH-2395 https://issues.apache.org/jira/browse/NUTCH-2395 Regards, Vyacheslav Pascarel -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Saturday, June 24, 2017 2:27 PM To: user@nutch.apache.org Subject: [EXTERNAL] - Re: ERROR: Cannot run job worker! Hi Vyacheslav, Thanks for the update, can you please open a ticket at https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_projects_NUTCH=DwIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj-ZGQKE=Ti7iePIyYmd-ZZLJikFB-XeUZ91T7llSIXn3mcnxQ0M=5h5L8GfDpA0DjwfnOwcxaZU2WGD4nRU74FhRnbC7hnM= If you are able to submit a pull request at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_=DwIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj-ZGQKE=Ti7iePIyYmd-ZZLJikFB-XeUZ91T7llSIXn3mcnxQ0M=9Sw9oUodC8CQBD2WhtzdrZ2Ey098yYpAbLjWwAX6zGw= , it would be appreciated. Lewis On Sat, Jun 24, 2017 at 9:36 AM, <user-digest-h...@nutch.apache.org> wrote: > > From: Vyacheslav Pascarel <vpasc...@opentext.com> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Fri, 23 Jun 2017 13:07:39 + > Subject: RE: [EXTERNAL] - Re: ERROR: Cannot run job worker! > Hi Lewis, > > I think I narrowed the problem to SelectorEntryComparator class nested > in GeneratorJob. In debugger during crash I noticed there a single > instance of SelectorEntryComparator shared across multiple reducer > tasks. The class is inherited from > org.apache.hadoop.io.WritableComparator that has a few members > unprotected for concurrent usage. At some point multiple threads may > access those members in WritableComparator.compare call. I modified > SelectorEntryComparator and it seems solved the problem but I am not > sure if the change is appropriate and/or sufficient (covers GENERATE > only?) > > Original code: > > > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > } > > Modified code: > > public static class SelectorEntryComparator extends WritableComparator { > public SelectorEntryComparator() { > super(SelectorEntry.class, true); > } > > @Override > synchronized public int compare(byte[] b1, int s1, int l1, byte[] > b2, int s2, int l2) { > return super.compare(b1, s1, l1, b2, s2, l2); > } > } > >
RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!
Hi Lewis, I think I narrowed the problem to SelectorEntryComparator class nested in GeneratorJob. In debugger during crash I noticed there a single instance of SelectorEntryComparator shared across multiple reducer tasks. The class is inherited from org.apache.hadoop.io.WritableComparator that has a few members unprotected for concurrent usage. At some point multiple threads may access those members in WritableComparator.compare call. I modified SelectorEntryComparator and it seems solved the problem but I am not sure if the change is appropriate and/or sufficient (covers GENERATE only?) Original code: public static class SelectorEntryComparator extends WritableComparator { public SelectorEntryComparator() { super(SelectorEntry.class, true); } } Modified code: public static class SelectorEntryComparator extends WritableComparator { public SelectorEntryComparator() { super(SelectorEntry.class, true); } @Override synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { return super.compare(b1, s1, l1, b2, s2, l2); } } Regards, Vyacheslav Pascarel -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Wednesday, June 21, 2017 1:41 PM To: user@nutch.apache.org Subject: [EXTERNAL] - Re: ERROR: Cannot run job worker! Hi Vyacheslav, Which version of Nutch are you using? 2.x? lewis On Wed, Jun 21, 2017 at 10:32 AM, <user-digest-h...@nutch.apache.org> wrote: > > > From: Vyacheslav Pascarel <vpasc...@opentext.com> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Wed, 21 Jun 2017 17:32:15 + > Subject: ERROR: Cannot run job worker! > Hello, > > I am writing an application that performs web site crawling using > Nutch REST services. The application: > > >
RE: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag
Hi Lewis, It seems that URLs get mangled when message posted to email list. The seed URL I that used was for MSNBC dot COM: http---www-msnbc-com (replace dashes with ":", "/", and ".") Regards, Vyacheslav Pascarel -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Thursday, June 22, 2017 2:11 PM To: user@nutch.apache.org Subject: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag Hi Vyacheslav, Can you provide me and example page with http refresh tag included? I'll try comparing behaviour between 2.X and master. Thank you Lewis On Sat, Jun 17, 2017 at 9:25 AM, <user-digest-h...@nutch.apache.org> wrote: > From: Vyacheslav Pascarel <vpasc...@opentext.com> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Fri, 16 Jun 2017 13:18:16 + > Subject: RE: [EXTERNAL] - Re: Outlinks field is not populated when > page from seed URL when fetched page contains "refresh" meta tag It is > 2.3.1. > >
RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!
2.3.1 Regards, Vyacheslav Pascarel -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Wednesday, June 21, 2017 1:41 PM To: user@nutch.apache.org Subject: [EXTERNAL] - Re: ERROR: Cannot run job worker! Hi Vyacheslav, Which version of Nutch are you using? 2.x? lewis On Wed, Jun 21, 2017 at 10:32 AM, <user-digest-h...@nutch.apache.org> wrote: > > > From: Vyacheslav Pascarel <vpasc...@opentext.com> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Wed, 21 Jun 2017 17:32:15 + > Subject: ERROR: Cannot run job worker! > Hello, > > I am writing an application that performs web site crawling using > Nutch REST services. The application: > > >
ERROR: Cannot run job worker!
fer.flush(MapTask.java:1462) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348) at org.apache.hadoop.io.Text.readString(Text.java:464) at org.apache.hadoop.io.Text.readString(Text.java:457) at org.apache.nutch.crawl.GeneratorJob$SelectorEntry.readFields(GeneratorJob.java:92) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:158) ... 15 more ... 2017-06-21 11:45:13,372 ERROR impl.JobWorker - Cannot run job worker! java.lang.RuntimeException: job failed: name=[parallel_0]generate: 1498059912-1448058551, jobid=job_local1142434549_0036 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:227) at org.apache.nutch.api.impl.JobWorker.run(JobWorker.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Regards, Vyacheslav Pascarel
RE: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag
It is 2.3.1. Vyacheslav Pascarel -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: Thursday, June 15, 2017 11:23 PM To: user@nutch.apache.org Subject: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag Hi Vyacheslav, On Thu, Jun 15, 2017 at 1:41 AM, <user-digest-h...@nutch.apache.org> wrote: > > From: Vyacheslav Pascarel <vpasc...@opentext.com> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Bcc: > Date: Wed, 14 Jun 2017 22:15:49 + > Subject: Outlinks field is not populated when page from seed URL when > fetched page contains "refresh" meta tag Hello, > > I am trying to crawl > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.msnbc.com_=D > wIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj > -ZGQKE=y4ak_4BuvKZMwom9X3QBIzAMVLnasMYLebPs0Evj-vk=HjlDXsBCmcJ9B2SZh5E05oDyZfRHKu3rrUcCL1hd0JA= > but having problem to get anything else beside the original seed URL. The > INJECT/GENERATE/FETCH steps complete without problems but after executing > PARSE I see only one outlink pointing to the original seed URL: > > ... Which version of Nutch are you using? Lewis
Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag
} catch (final MalformedURLException e) { toHost = null; } if (toHost == null || !toHost.equals(fromHost)) { // external links continue; // skip it } } validCount++; page.getOutlinks().put(utf8ToUrl, new Utf8(outlinks[i].getAnchor())); } Utf8 fetchMark = Mark.FETCH_MARK.checkMark(page); if (fetchMark != null) { Mark.PARSE_MARK.putMark(page, fetchMark); } } } ... Regards, Vyacheslav Pascarel