RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!

2017-06-26 Thread Vyacheslav Pascarel
Done - NUTCH-2395

https://issues.apache.org/jira/browse/NUTCH-2395

Regards,

Vyacheslav Pascarel


-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Saturday, June 24, 2017 2:27 PM
To: user@nutch.apache.org
Subject: [EXTERNAL] - Re: ERROR: Cannot run job worker!

Hi Vyacheslav,
Thanks for the update, can you please open a ticket at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_projects_NUTCH=DwIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj-ZGQKE=Ti7iePIyYmd-ZZLJikFB-XeUZ91T7llSIXn3mcnxQ0M=5h5L8GfDpA0DjwfnOwcxaZU2WGD4nRU74FhRnbC7hnM=
If you are able to submit a pull request at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_=DwIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj-ZGQKE=Ti7iePIyYmd-ZZLJikFB-XeUZ91T7llSIXn3mcnxQ0M=9Sw9oUodC8CQBD2WhtzdrZ2Ey098yYpAbLjWwAX6zGw=
 , it would be appreciated.
Lewis

On Sat, Jun 24, 2017 at 9:36 AM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: Vyacheslav Pascarel <vpasc...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Fri, 23 Jun 2017 13:07:39 +
> Subject: RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!
> Hi Lewis,
>
> I think I narrowed the problem to SelectorEntryComparator class nested 
> in GeneratorJob. In debugger during crash I noticed there a single 
> instance of SelectorEntryComparator shared across multiple reducer 
> tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator that has a few members 
> unprotected for concurrent usage. At some point multiple threads may 
> access those members in WritableComparator.compare call. I modified 
> SelectorEntryComparator and it seems solved the problem but I am not 
> sure if the change is appropriate and/or sufficient (covers GENERATE 
> only?)
>
> Original code:
> 
>
>   public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
>   }
>
> Modified code:
> 
>   public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
>
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] 
> b2, int s2, int l2) {
> return super.compare(b1, s1, l1, b2, s2, l2);
> }
>   }
>
>


RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!

2017-06-23 Thread Vyacheslav Pascarel
Hi Lewis,

I think I narrowed the problem to SelectorEntryComparator class nested in 
GeneratorJob. In debugger during crash I noticed there a single instance of 
SelectorEntryComparator shared across multiple reducer tasks. The class is 
inherited from org.apache.hadoop.io.WritableComparator that has a few members 
unprotected for concurrent usage. At some point multiple threads may access 
those members in WritableComparator.compare call. I modified 
SelectorEntryComparator and it seems solved the problem but I am not sure if 
the change is appropriate and/or sufficient (covers GENERATE only?)

Original code:


  public static class SelectorEntryComparator extends WritableComparator {
public SelectorEntryComparator() {
  super(SelectorEntry.class, true);
}
  }

Modified code:

  public static class SelectorEntryComparator extends WritableComparator {
public SelectorEntryComparator() {
  super(SelectorEntry.class, true);
}

@Override
synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2, int 
s2, int l2) {
return super.compare(b1, s1, l1, b2, s2, l2);
}
  }

Regards,

Vyacheslav Pascarel


-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Wednesday, June 21, 2017 1:41 PM
To: user@nutch.apache.org
Subject: [EXTERNAL] - Re: ERROR: Cannot run job worker!

Hi Vyacheslav,

Which version of Nutch are you using? 2.x?
lewis

On Wed, Jun 21, 2017 at 10:32 AM, <user-digest-h...@nutch.apache.org> wrote:

>
>
> From: Vyacheslav Pascarel <vpasc...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Wed, 21 Jun 2017 17:32:15 +
> Subject: ERROR: Cannot run job worker!
> Hello,
>
> I am writing an application that performs web site crawling using 
> Nutch REST services. The application:
>
>
>


RE: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-22 Thread Vyacheslav Pascarel
Hi Lewis,

It seems that URLs get mangled when message posted to email list. The seed URL 
I that used was  for MSNBC dot COM: 

http---www-msnbc-com  (replace dashes with ":", "/", and ".")

Regards,

Vyacheslav Pascarel


-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Thursday, June 22, 2017 2:11 PM
To: user@nutch.apache.org
Subject: [EXTERNAL] - Re: Outlinks field is not populated when page from seed 
URL when fetched page contains "refresh" meta tag

Hi Vyacheslav,
Can you provide me and example page with http refresh tag included? I'll try 
comparing behaviour between 2.X and master.
Thank you
Lewis

On Sat, Jun 17, 2017 at 9:25 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Vyacheslav Pascarel <vpasc...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Fri, 16 Jun 2017 13:18:16 +
> Subject: RE: [EXTERNAL] - Re: Outlinks field is not populated when 
> page from seed URL when fetched page contains "refresh" meta tag It is 
> 2.3.1.
>
>


RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!

2017-06-21 Thread Vyacheslav Pascarel
2.3.1

Regards,

Vyacheslav Pascarel


-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Wednesday, June 21, 2017 1:41 PM
To: user@nutch.apache.org
Subject: [EXTERNAL] - Re: ERROR: Cannot run job worker!

Hi Vyacheslav,

Which version of Nutch are you using? 2.x?
lewis

On Wed, Jun 21, 2017 at 10:32 AM, <user-digest-h...@nutch.apache.org> wrote:

>
>
> From: Vyacheslav Pascarel <vpasc...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Wed, 21 Jun 2017 17:32:15 +
> Subject: ERROR: Cannot run job worker!
> Hello,
>
> I am writing an application that performs web site crawling using 
> Nutch REST services. The application:
>
>
>


ERROR: Cannot run job worker!

2017-06-21 Thread Vyacheslav Pascarel
fer.flush(MapTask.java:1462)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700)
at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:267)
at 
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at 
org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
at org.apache.hadoop.io.Text.readString(Text.java:464)
at org.apache.hadoop.io.Text.readString(Text.java:457)
at 
org.apache.nutch.crawl.GeneratorJob$SelectorEntry.readFields(GeneratorJob.java:92)
at 
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:158)
... 15 more

...

2017-06-21 11:45:13,372 ERROR impl.JobWorker - Cannot run job worker!
java.lang.RuntimeException: job failed: name=[parallel_0]generate: 
1498059912-1448058551, jobid=job_local1142434549_0036
at 
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at 
org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:227)
at org.apache.nutch.api.impl.JobWorker.run(JobWorker.java:64)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Regards,

Vyacheslav Pascarel



RE: [EXTERNAL] - Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-16 Thread Vyacheslav Pascarel
It is 2.3.1. 

Vyacheslav Pascarel


-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Thursday, June 15, 2017 11:23 PM
To: user@nutch.apache.org
Subject: [EXTERNAL] - Re: Outlinks field is not populated when page from seed 
URL when fetched page contains "refresh" meta tag

Hi Vyacheslav,

On Thu, Jun 15, 2017 at 1:41 AM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: Vyacheslav Pascarel <vpasc...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Wed, 14 Jun 2017 22:15:49 +
> Subject: Outlinks field is not populated when page from seed URL when 
> fetched page contains "refresh" meta tag Hello,
>
> I am trying to crawl 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.msnbc.com_=D
> wIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj
> -ZGQKE=y4ak_4BuvKZMwom9X3QBIzAMVLnasMYLebPs0Evj-vk=HjlDXsBCmcJ9B2SZh5E05oDyZfRHKu3rrUcCL1hd0JA=
>   but having problem to get anything else beside the original seed URL. The 
> INJECT/GENERATE/FETCH steps complete without problems but after executing 
> PARSE I see only one outlink pointing to the original seed URL:
>
> ...
Which version of Nutch are you using?
Lewis


Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-14 Thread Vyacheslav Pascarel
  } catch (final MalformedURLException e) {
  toHost = null;
}
if (toHost == null || !toHost.equals(fromHost)) { // external links
  continue; // skip it
}
  }
  validCount++;
  page.getOutlinks().put(utf8ToUrl, new Utf8(outlinks[i].getAnchor()));
}
Utf8 fetchMark = Mark.FETCH_MARK.checkMark(page);
if (fetchMark != null) {
  Mark.PARSE_MARK.putMark(page, fetchMark);
}
  }
}
...

Regards,

Vyacheslav Pascarel