Hi Michael, I tried to reproduce the problem with the current Nutch master and Solr 6.6.0 without success, resp. indexing the binary content succeeded: - that's the case for two of the URLs you sent - those from buzz.money.cnn.com are blocked somehow (fetching failed)
Building Nutch isn't difficult: git clone http://github.com/apache/nutch.git cd nutch ant You'll find the Nutch runtime is in runtime/local/ or runtime/deploy/ (for usage on Hadoop). The tutorial https://wiki.apache.org/nutch/NutchTutorial should be already up-to-date on how to use recent Solr versions. Best, Sebastian { "responseHeader":{ "status":0, "QTime":2, "params":{ "q":"id:http\\://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html", "indent":"on", "wt":"json", "_":"1508829081797"}}, "response":{"numFound":1,"start":0,"docs":[ { "date":"2017-10-24T07:01:05.593Z", "author":"Matt Egan", "title":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, 2017", "type":["application/xhtml+xml", "application", "xhtml+xml"], "url":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html", "content":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, ...", "tstamp":"2017-10-24T07:01:05.593Z", "segment":"20171024090054", "digest":"cff265f11bd74bd104f3c6e1c7185484", "boost":1.0, "id":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html", "_version_":1582121409782480896, "binaryContent":"+IDxzY3JpcHQgdHlwZT0idGV4dC9qYXZhc2NyaXB0Ij4gdmFyIHVybFByZT0iaHR0cDovL21hcmtld...""}] }} On 10/24/2017 01:07 AM, Michael Coffey wrote: > http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html > > > http://buzz.money.cnn.com/author/ctymkiw/ > > http://abcnews.go.com/GMA/video/rose-mcgowan-dropped-agent-calling-sexist-casting-note-32047448 > > http://buzz.money.cnn.com/tag/investing/ > > Meanwhile, the following URL also gets an "error adding field" message but > with "msg=Illegal character" instead of "String length must be a multiple of > four". Don't know if it's related. > > http://buzz.money.cnn.com/author/byheatherlong/

