Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-16 Thread Semyon Semyonov
Hi again, Another issue has appeared with introduction of bidirectional url exemption filter. Having http://www.website.com/page1 and http://website.com/page2 Before as an indexer output(lets say a text file) I had one parent/host(www.website.com) with

Re: Fetcher error when running on Amazon EMR with S3

2018-03-16 Thread Sebastian Nagel
Hi John, the recent master has seen an upgrade to the new MapReduce API (NUTCH-2375), it was a huge change which is already known to have introduced some issues. For production it's recommended to use 1.14 and if necessary patch it. Could you open a new issue on

Fetcher error when running on Amazon EMR with S3

2018-03-16 Thread John Thornton
Hello, I'm currently running Nutch under Amazon EMR 5.12.0 with Hadoop 2.83 using S3 (EMRFS) as the filesystem. If I build the latest version from the master branch and run a crawl in distributed mode I get a fetcher error like fetcher.Fetcher: Fetcher: java.lang.IllegalArgumentException: Wrong