Re: addBinaryContent and string length must be a multiple of four
Hi Michael, I tried to reproduce the problem with the current Nutch master and Solr 6.6.0 without success, resp. indexing the binary content succeeded: - that's the case for two of the URLs you sent - those from buzz.money.cnn.com are blocked somehow (fetching failed) Building Nutch isn't difficult: git clone http://github.com/apache/nutch.git cd nutch ant You'll find the Nutch runtime is in runtime/local/ or runtime/deploy/ (for usage on Hadoop). The tutorial https://wiki.apache.org/nutch/NutchTutorial should be already up-to-date on how to use recent Solr versions. Best, Sebastian { "responseHeader":{ "status":0, "QTime":2, "params":{ "q":"id:http\\://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html", "indent":"on", "wt":"json", "_":"1508829081797"}}, "response":{"numFound":1,"start":0,"docs":[ { "date":"2017-10-24T07:01:05.593Z", "author":"Matt Egan", "title":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, 2017", "type":["application/xhtml+xml", "application", "xhtml+xml"], "url":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html;, "content":"Trump adviser Carl Icahn bets against the Trump rally - Mar. 7, ...", "tstamp":"2017-10-24T07:01:05.593Z", "segment":"20171024090054", "digest":"cff265f11bd74bd104f3c6e1c7185484", "boost":1.0, "id":"http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html;, "_version_":1582121409782480896, "binaryContent":"+IDxzY3JpcHQgdHlwZT0idGV4dC9qYXZhc2NyaXB0Ij4gdmFyIHVybFByZT0iaHR0cDovL21hcmtld...""}] }} On 10/24/2017 01:07 AM, Michael Coffey wrote: > http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html > > > http://buzz.money.cnn.com/author/ctymkiw/ > > http://abcnews.go.com/GMA/video/rose-mcgowan-dropped-agent-calling-sexist-casting-note-32047448 > > http://buzz.money.cnn.com/tag/investing/ > > Meanwhile, the following URL also gets an "error adding field" message but > with "msg=Illegal character" instead of "String length must be a multiple of > four". Don't know if it's related. > > http://buzz.money.cnn.com/author/byheatherlong/
Re: addBinaryContent and string length must be a multiple of four
Thanks for the reply! I'm not sure the best way to illustrate the issue, as I struggle with solr log management within docker. However, here are a few URLs that have exhibited the problem. In each case, Solr complains "Error adding field 'binaryContent'" ... "msg=String length must be a multiple of four" http://cnnfn.cnn.com/2017/03/07/investing/carl-icahn-betting-against-trump-rally/index.html http://buzz.money.cnn.com/author/ctymkiw/ http://abcnews.go.com/GMA/video/rose-mcgowan-dropped-agent-calling-sexist-casting-note-32047448 http://buzz.money.cnn.com/tag/investing/ Meanwhile, the following URL also gets an "error adding field" message but with "msg=Illegal character" instead of "String length must be a multiple of four". Don't know if it's related. http://buzz.money.cnn.com/author/byheatherlong/ All tests done with Nutch 1.12, Solr 5.4.1. BTW, I wouldn't mind updating Nutch and Solr. What is your recommended most-stable combination of versions? I am using Hadoop 2.7.3 (from Hortonworks). At one point, Lewis John McG reported on such an issue in https://issues.apache.org/jira/browse/NUTCH-2186
Re: addBinaryContent and string length must be a multiple of four
Hi Michael, can you share more information regarding Nutch and Solr version and at least one document to make the problem reproducible. Looks like that's not a general problem - at least, I'm not able to reproduce it, indexing with -addBinaryContent -base64 succeeds (recent Nutch snapshot / master, Solr 6.6.0). Thanks, Sebastian On 10/20/2017 06:46 PM, Michael Coffey wrote: > I guess there is no solution or workaround for the addBinaryContent bug, so I > have to write code to read directly from segment data. If not writing Java, I > guess I have to do readseg-dump and then parse the output text file. > > > -- original message -- > I think I have an instance of the known bug > https://issues.apache.org/jira/browse/NUTCH-2186 > > I need to keep raw html in my Solr index (or somewhere) so that an external > tool can access it and parse it. So, I added addBinaryContent and base64 to > my indexing command. On the very first segment, I get a bunch of failures > with messages that say "String length must be a multiple of four." The same > is true if I omit the base64 argument. > > Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr > 5.4.1. >
Re: addBinaryContent and string length must be a multiple of four
I guess there is no solution or workaround for the addBinaryContent bug, so I have to write code to read directly from segment data. If not writing Java, I guess I have to do readseg-dump and then parse the output text file. -- original message -- I think I have an instance of the known bug https://issues.apache.org/jira/browse/NUTCH-2186 I need to keep raw html in my Solr index (or somewhere) so that an external tool can access it and parse it. So, I added addBinaryContent and base64 to my indexing command. On the very first segment, I get a bunch of failures with messages that say "String length must be a multiple of four." The same is true if I omit the base64 argument. Is there a workaround or fix for this issue? I am using Nutch 1.12 and Solr 5.4.1.