RE: invalid utf8 chars when indexing or cleaning

2017-08-31 Thread Markus Jelsma
The bug is identical, but i fixed it! You should verify the output Nutch generates and inspect it manually, there should be a 0x at that byte. If it really is there, we need to check the fix once more, despite that i am sure the patch works as intended. Get the XML, pass it through the

Re: invalid utf8 chars when indexing or cleaning

2017-08-31 Thread Michael Coffey
It sounds like a good suggestion, but I don't know what you mean by "verify the output Nutch generates and inspect it manually." How do I get a look at that XML? From: To: "user@nutch.apache.org" Sent: Thursday, August 31, 2017 11:59 AM Subject: RE: invalid