Hi Jarcec, Thanks for your help.
Found that the connection encoding is Latin 1 and adding the following to my.cnf on the hadoop server fixed the utf8 load issue. [client] default-character-set=utf8 Thanks, Varun ________________________________ From: Jarek Jarcec Cecho <[email protected]> To: [email protected]; varun kumar gullipalli <[email protected]> Sent: Sunday, July 21, 2013 9:57 AM Subject: Re: Sqoop - utf-8 data load issue Hi Varun, I've never had issues with Sqooping UTF data. Do you think that you can do mysqldump of the table in question? If you could share it with the Sqoop version and exact command line I would like to explore that a bit. Jarcec On Wed, Jul 17, 2013 at 05:42:01PM -0700, varun kumar gullipalli wrote: > Thanks Jarcec. > sqoop version is 1.4.2 > > > I was verifying the QueryResult.java file that sqoop creates; type is the > column name which has multi-byte data(utf-8). > Does declaring type as string work for multi-byte data? > > grep type QueryResult.java > private String type; > public String get_type() { > return type; > public void set_type(String type) { > this.type = type; > public QueryResult with_type(String type) { > this.type = type; > equal = equal && (this.type == null ? that.type == null : > this.type.equals(that.type)); > this.type = JdbcWritableBridge.readString(5, __dbResults); > JdbcWritableBridge.writeString(type, 5 + __off, 12, __dbStmt); > this.type = null; > this.type = Text.readString(__dataIn); > if (null == this.type) { > Text.writeString(__dataOut, type); > __sb.append(FieldFormatter.escapeAndEnclose(type==null?"\\N":type > (file://n%22:type/), delimiters)); > if (__cur_str.equals("null")) { this.type = null; } else { > this.type = __cur_str; > __sqoop$field_map.put("type", this.type); > else if ("type".equals(__fieldName)) { > this.type = (String) __fieldVal; > > > > > > ________________________________ > From: Jarek Jarcec Cecho <[email protected]> > To: [email protected]; varun kumar gullipalli <[email protected]> > Sent: Wednesday, July 17, 2013 8:36 AM > Subject: Re: Sqoop - utf-8 data load issue > > > Thank you Varun, > the sequence c3 83 c2 a9 indeed do not correspond to correct character. I was > able to google out one entry in stack overflow [1] that might be relevant to > your issue somehow. I've tried to reproduce this on my cluster, but I was not > able to. Do you think that you can do mysqldump of the table in question? If > you could share it with the Sqoop version and exact command line I would like > to explore that a bit. > > Jarcec > > Links: > 1: > http://stackoverflow.com/questions/8499852/xmldocument-mis-reads-utf-8-e-acute-character > > On Tue, Jul 16, 2013 at 04:24:49PM -0700, varun kumar gullipalli wrote: > > Here is the output Jarcec... > > > > > > > > > > ________________________________ > > From: Jarek Jarcec Cecho <[email protected]> > > To: [email protected]; varun kumar gullipalli <[email protected]> > > Sent: Tuesday, July 16, 2013 11:05 AM > > Subject: Re: Sqoop - utf-8 data load issue > > > > > > Thank you for the additional information Varun! Would you mind doing > > something like the following: > > > > hadoop dfs -text THE_FILE | hexdump -C > > > > And sharing the output? I'm trying to see the actual content of the file > > rather than any interpreted value. > > > > Jarcec > > > > On Mon, Jul 15, 2013 at 06:52:11PM -0700, varun kumar gullipalli wrote: > > > Hi Jarcec, > > > > > > I am validating the data by running the following command, > > > > > > hadoop fs -text <hdfs cluster> > > > > > > I think there is no issue with the shell (correct me if am wrong) because > > > I am connecting to MySQL database from the same shell(command line) and > > > could view the source data properly. > > > > > > Initially we observed that the following conf files doesn't have utf-8 > > > encoding. > > > <?xml version="1.0" encoding="UTF-8"?> > > > > > > sqoop-site.xml > > > sqoop=site-template.xml > > > > > > But no luck after making the changes too. > > > > > > Thanks, > > > Varun > > > > > > > > > ________________________________ > > > From: Jarek Jarcec Cecho <[email protected]> > > > To: [email protected]; varun kumar gullipalli > > > <[email protected]> > > > Sent: Monday, July 15, 2013 6:37 PM > > > Subject: Re: Sqoop - utf-8 data load issue > > > > > > > > > Hi Varun, > > > we are usually not seeing any issues with transferring text data in UTF. > > > How are > > > you validating the imported file? I can imagine that your shell might be > > > messing > > > the encoding. > > > > > > Jarcec > > > > > > On Mon, Jul 15, 2013 at 06:27:25PM -0700, varun kumar gullipalli wrote: > > > > > > > > > > > > Hi, > > > > I am importing data from MySql to HDFS using free-form query import. > > > > It works fine but facing issue when the data is utf-8.The source(MySql) > > > > db is utf-8 compatible but looks like sqoop is converting the data > > > > during import. > > > > Example - The source value - elémeñt is loaded as elémeñt to HDFS. > > > > Please provide a solution for this. > > > > Thanks in advance! > > > > > > 00000000 31 32 33 34 35 36 37 38 39 30 07 31 33 37 33 32 > > |1234567890.13732| > > 00000010 36 30 33 34 36 31 35 31 07 31 33 37 33 32 36 30 > > |60346151.1373260| > > 00000020 33 34 36 31 35 31 07 30 07 65 6c c3 83 c2 a9 6d > > |346151.0.el....m| > > 00000030 65 c3 83 c2 b1 74 07 c3 a8 c2 b4 c2 bc c3 a2 e2 > > |e....t..........| > > 00000040 80 9a c2 ac c3 ac e2 80 9a c2 ac c3 ac e2 80 93 > > |................| > > 00000050 c2 b4 c3 a8 e2 80 b0 c2 be c3 a8 c2 a5 c2 bf 0a > > |................| > > 00000060 > > > Here is a sample command line .... > sqoop --options-file $CONN_FILE --lines-terminated-by '\n' --verbose > --query "<<QUERY>>' and \$CONDITIONS" -m 1 --target-dir > $YYYY/$MM/$DD/${TBL_NAME} --null-string '\\N' --null-non-string '\\N' >> > $LOGFILE 2>&1
