Okay, sounds like you may actually need it. I've updated the information at http://nlp.solutions.asia/?p=180 to use utf8mb4. If you could use that with MySQL 5.5 or above and see if it helps. It is changed in three places -- the db server config, the db creation and the table creation.
-----Original Message----- From: sumarlidason [mailto:[email protected]] Sent: Wednesday, October 24, 2012 9:31 AM To: [email protected] Subject: RE: nutch/hadoop/solr Actually, that is the tutorial I followed. I'm still getting these errors.. this string, \xF0\x9F\x92\x83, is actually this character: I assume thats where the issue is. However I am unable to reproduce the error when manually inserting via /usr/bin/mysql. I read this article, http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/, he suggests that utf8_bin might resolve the issue. Other forums suggest that even though the default charset is set, the column charset has to be specifically set as well. I can't get passed the fact that MySQL pre 5.5 is only storing 1-3Bytes UTF instead of 1-4Bytes. j.sullivan wrote > Sumarlidason > > Hi > > The need to use utf8mb4 for web crawling should be fairly rare. If you > are using MySQL 5.5 or later and have a set up like this > http://nlp.solutions.asia/?p=180 you should be fine. > > James -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-hadoop-solr-tp4014761p4015480.html Sent from the Nutch - User mailing list archive at Nabble.com.

