Re: Nutch 1.11 SSLHandshakeException

2018-03-20 Thread Sebastian Nagel
Hi Robert, unfortunately, I'm not able to reproduce the problem. Fetching works with the recent 1.x and Java 8, I've tried both: bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' https://potomac.edu/ bin/nutch parsechecker

Re: Is there any way to block the hubpages while crawling

2018-03-20 Thread Sebastian Nagel
Hi, > more control over what is being indexed? It's possible to enable URL filters for the indexer: bin/nutch index ... -filter With little extra effort you can use different URL filter rules during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR to a different folder. >> I

Re: Is there any way to block the hubpages while crawling

2018-03-20 Thread Michael Coffey
I think you will find that you need different rules for each website and that some amount of maintenance will be needed as the websites change their practices.

Re: Internal links appear to be external in Parse. Improvement of the crawling quality

2018-03-20 Thread Semyon Semyonov
I found out that there is no direct way to do it, the problem was solved through calling of the regex transformation one more time in IndexerMapReduce, before the Indexer gets the Doc for writting. Something like(IndexerMapReduce.java:line 369),  doc.add("modifiedId",

RE: Is there any way to block the hubpages while crawling

2018-03-20 Thread Markus Jelsma
Hello Shiva, Yes, that is possible, but it (ours) is not a fool proof solution. We got our first hub classifier years ago in the form of a simple ParseFilter backed by an SVM. The model was built solely on the HTML of positive and negative examples, with very few features, so it was extremely

Re: Nutch 1.11 SSLHandshakeException

2018-03-20 Thread Robert Scavilla
Again I thank you Sebastian! I was able to resolve the issue by updating the HTTPClient library. I also updated from Nutch 1.11 to 1.14 and had no issue with the SSL. Best, ...bob On Tue, Mar 20, 2018 at 5:03 PM, Sebastian Nagel wrote: > Hi Robert, > > although

Re: Nutch 1.11 SSLHandshakeException

2018-03-20 Thread Sebastian Nagel
Hi Robert, although the error message differs, somewhat resembles https://issues.apache.org/jira/browse/NUTCH-2447 I've tried to reproduce it using Nutch 1.11, but it works with Java 8 on Ubuntu 16.04. Sorry, I have no glue where even to start searching for the reason. Best, Sebastian On

Re: Nutch 1.11 SSLHandshakeException

2018-03-20 Thread Robert Scavilla
Thank you Sebastian! I am still working on the issue. I tested the cert using openssl and also got the same handshake failure. After further checking I found that the openssl command works when I add the -servername option. So apparently, my nutch server (Fedora 27) requires SNI. I added