yes, NUTCH-1016 already fixed this problem. The property "parser.character.encoding.default" is used when EncodingDetctor can not detected the content encoding. It set the defaut encoding to this page content. If this detection is wrong, sometimes it will result unreadable code of parse content. like [0]
[0] http://mail-archives.apache.org/mod_mbox/nutch-user/201303.mbox/%3ccaoewmmp7ngle6otgmbepb450cmobc3w0xjk6ohs1raff_5q...@mail.gmail.com%3E On Mon, Mar 18, 2013 at 10:31 AM, neeraj <[email protected]> wrote: > Amuseme, > > Thanks for the reply. I reviewed the exceptions given on the link and I > am not getting any of those. I have more than 5 million documents crawled > and was able to index 120 K documents to Solr before this exception > occurred > for invalid XML character. > > I was trying to investigate around this issue and found that there are > previous posts on the same topic where the patch was being applied to > stripNonCharCodepoints(). But that is already part of Nutch 1.6 and I am > still getting the same exception. > > My "parser.character.encoding.default" was set to windows-1252 when > crawling > all these documents. Could that have let to this exception when indexing? > > Any insight on this will be helpful. > > Thanks, > Neeraj. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-1-6-Need-help-with-Indexing-tp4048290p4048391.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Don't Grow Old, Grow Up... :-)

