Nutch 1.13 parsing links but ignoring them?

2017-06-26 Thread Yossi Tamari
I'm seeing many cases where ParserChecker finds outlinks in a document, but
when running crawl on this document they do not appear in the crawl DB at
all (and are not indexed). 

My URL filters are trivial as far as I can tell, and the missing links are
not special in any way that I can see.

For example:

/bin/nutch parsechecker -dumpText "http://corporate.exxonmobil.com/;

finds, among others, the URLs https://energyfactor.exxonmobil.com/ and
http://corporate.exxonmobil.com/en/investors/corporate-governance.

However, when running

bin/crawl  urls_yossi yossi 2

with only http://corporate.exxonmobil.com/ in urls_yossi, and then dumping
yossi/crawldb (using `nutch readdb`), the two above URLs are not found.

When finished, the crawldb contains 786 entries, which is far below topN.

 

Any idea what could be causing these URLs to be ignored?



RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!

2017-06-26 Thread Vyacheslav Pascarel
Done - NUTCH-2395

https://issues.apache.org/jira/browse/NUTCH-2395

Regards,

Vyacheslav Pascarel


-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: Saturday, June 24, 2017 2:27 PM
To: user@nutch.apache.org
Subject: [EXTERNAL] - Re: ERROR: Cannot run job worker!

Hi Vyacheslav,
Thanks for the update, can you please open a ticket at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_projects_NUTCH=DwIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj-ZGQKE=Ti7iePIyYmd-ZZLJikFB-XeUZ91T7llSIXn3mcnxQ0M=5h5L8GfDpA0DjwfnOwcxaZU2WGD4nRU74FhRnbC7hnM=
If you are able to submit a pull request at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_=DwIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=XeO6ShRDVKU6HktuQu5d6DHtkdlyuxMSWDVUj-ZGQKE=Ti7iePIyYmd-ZZLJikFB-XeUZ91T7llSIXn3mcnxQ0M=9Sw9oUodC8CQBD2WhtzdrZ2Ey098yYpAbLjWwAX6zGw=
 , it would be appreciated.
Lewis

On Sat, Jun 24, 2017 at 9:36 AM,  wrote:

>
> From: Vyacheslav Pascarel 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Fri, 23 Jun 2017 13:07:39 +
> Subject: RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!
> Hi Lewis,
>
> I think I narrowed the problem to SelectorEntryComparator class nested 
> in GeneratorJob. In debugger during crash I noticed there a single 
> instance of SelectorEntryComparator shared across multiple reducer 
> tasks. The class is inherited from 
> org.apache.hadoop.io.WritableComparator that has a few members 
> unprotected for concurrent usage. At some point multiple threads may 
> access those members in WritableComparator.compare call. I modified 
> SelectorEntryComparator and it seems solved the problem but I am not 
> sure if the change is appropriate and/or sufficient (covers GENERATE 
> only?)
>
> Original code:
> 
>
>   public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
>   }
>
> Modified code:
> 
>   public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
>
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] 
> b2, int s2, int l2) {
> return super.compare(b1, s1, l1, b2, s2, l2);
> }
>   }
>
>