Re: Nutch NTLM to IIS 8.5 - issues!

2019-04-26 Thread Michael Portnoy
Nutch 1.14 is using HttpClient 3.x which does not work with NTLM2. Not sure if that's your case. To get auth to work, we've had to migrate the httpclient plugin to use HttpClient 4.x This may have been done in Nutch 1.15 On Fri., Apr. 26, 2019, 10:24 a.m. Larry.Santello, wrote: > Been reading a

Re: [VOTE] Release Apache Nutch 1.16 RC#1

2019-10-02 Thread Michael Portnoy
+1 On Wed., Oct. 2, 2019, 1:55 p.m. Sebastian Nagel, wrote: > Hi Folks, > > A first candidate for the Nutch 1.16 release is available at: > >https://dist.apache.org/repos/dist/dev/nutch/1.16/ > > The release candidate is a zip and tar.gz archive of the binary and > sources in: >https://g

protocol-selenium plug-in incompatible with downstream plugins

2017-10-25 Thread Michael Portnoy
The pages that I'm crawling are dynamically generated (i.e. using javascript) for which purpose I am using the `protocol-selenium` plugin instead of `protocol-http` as per https://wiki.apache.org/nutch/AdvancedAjaxInteraction. Problem: protocol-selenium is using lib-selenium which unlike protocol

indexer-solr is failing to de-duplicate URL encoded URLs

2018-03-07 Thread Michael Portnoy
for " + key); } catch (IllegalArgumentException e) { LOG.warn("Could not decode: " + key + ", it probably wasn't encoded in the first place.."); } Commenting out the above resolves the issue, but I don't understand why this workaround was added in the first place.