Thank you for sharing your final solution Varun! Jarcec
On Mon, Jul 22, 2013 at 09:30:25PM -0700, varun kumar gullipalli wrote: > Hi Jarcec, > > Thanks for your help. > > Found that the connection encoding is Latin 1 and adding the following to > my.cnf on the hadoop server fixed the utf8 load issue. > [client] > default-character-set=utf8 > > > Thanks, > Varun > > > ________________________________ > From: Jarek Jarcec Cecho <[email protected]> > To: [email protected]; varun kumar gullipalli <[email protected]> > Sent: Sunday, July 21, 2013 9:57 AM > Subject: Re: Sqoop - utf-8 data load issue > > > Hi Varun, > I've never had issues with Sqooping UTF data. Do you think that you can do > mysqldump of the table in question? If you could share it with the Sqoop > version and exact command line I would like to explore that a bit. > > Jarcec > > On Wed, Jul 17, 2013 at 05:42:01PM -0700, varun kumar gullipalli wrote: > > Thanks Jarcec. > > sqoop version is 1.4.2 > > > > > > I was verifying the QueryResult.java file that sqoop creates; type is the > > column name which has multi-byte data(utf-8). > > Does declaring type as string work for multi-byte data? > > > > grep type QueryResult.java > > private String type; > > public String get_type() { > > return type; > > public void set_type(String type) { > > this.type = type; > > public QueryResult with_type(String type) { > > this.type = type; > > equal = equal && (this.type == null ? that.type == null : > > this.type.equals(that.type)); > > this.type = JdbcWritableBridge.readString(5, __dbResults); > > JdbcWritableBridge.writeString(type, 5 + __off, 12, __dbStmt); > > this.type = null; > > this.type = Text.readString(__dataIn); > > if (null == this.type) { > > Text.writeString(__dataOut, type); > > __sb.append(FieldFormatter.escapeAndEnclose(type==null?"\\N":type > > (file://n%22:type/), delimiters)); > > if (__cur_str.equals("null")) { this.type = null; } else { > > this.type = __cur_str; > > __sqoop$field_map.put("type", this.type); > > else if ("type".equals(__fieldName)) { > > this.type = (String) __fieldVal; > > > > > > > > > > > > ________________________________ > > From: Jarek Jarcec Cecho <[email protected]> > > To: [email protected]; varun kumar gullipalli <[email protected]> > > Sent: Wednesday, July 17, 2013 8:36 AM > > Subject: Re: Sqoop - utf-8 data load issue > > > > > > Thank you Varun, > > the sequence c3 83 c2 a9 indeed do not correspond to correct character. I > > was able to google out one entry in stack overflow [1] that might be > > relevant to your issue somehow. I've tried to reproduce this on my cluster, > > but I was not able to. Do you think that you can do mysqldump of the table > > in question? If you could share it with the Sqoop version and exact > > command line I would like to explore that a bit. > > > > Jarcec > > > > Links: > > 1: > > http://stackoverflow.com/questions/8499852/xmldocument-mis-reads-utf-8-e-acute-character > > > > On Tue, Jul 16, 2013 at 04:24:49PM -0700, varun kumar gullipalli wrote: > > > Here is the output Jarcec... > > > > > > > > > > > > > > > ________________________________ > > > From: Jarek Jarcec Cecho <[email protected]> > > > To: [email protected]; varun kumar gullipalli > > > <[email protected]> > > > Sent: Tuesday, July 16, 2013 11:05 AM > > > Subject: Re: Sqoop - utf-8 data load issue > > > > > > > > > Thank you for the additional information Varun! Would you mind doing > > > something like the following: > > > > > > hadoop dfs -text THE_FILE | hexdump -C > > > > > > And sharing the output? I'm trying to see the actual content of the file > > > rather than any interpreted value. > > > > > > Jarcec > > > > > > On Mon, Jul 15, 2013 at 06:52:11PM -0700, varun kumar gullipalli wrote: > > > > Hi Jarcec, > > > > > > > > I am validating the data by running the following command, > > > > > > > > hadoop fs -text <hdfs cluster> > > > > > > > > I think there is no issue with the shell (correct me if am wrong) > > > > because I am connecting to MySQL database from the same shell(command > > > > line) and could view the source data properly. > > > > > > > > Initially we observed that the following conf files doesn't have utf-8 > > > > encoding. > > > > <?xml version="1.0" encoding="UTF-8"?> > > > > > > > > sqoop-site.xml > > > > sqoop=site-template.xml > > > > > > > > But no luck after making the changes too. > > > > > > > > Thanks, > > > > Varun > > > > > > > > > > > > ________________________________ > > > > From: Jarek Jarcec Cecho <[email protected]> > > > > To: [email protected]; varun kumar gullipalli > > > > <[email protected]> > > > > Sent: Monday, July 15, 2013 6:37 PM > > > > Subject: Re: Sqoop - utf-8 data load issue > > > > > > > > > > > > Hi Varun, > > > > we are usually not seeing any issues with transferring text data in > > > > UTF. How are > > > > you validating the imported file? I can imagine that your shell might > > > > be messing > > > > the encoding. > > > > > > > > Jarcec > > > > > > > > On Mon, Jul 15, 2013 at 06:27:25PM -0700, varun kumar gullipalli wrote: > > > > > > > > > > > > > > > Hi, > > > > > I am importing data from MySql to HDFS using free-form query import. > > > > > It works fine but facing issue when the data is utf-8.The > > > > > source(MySql) db is utf-8 compatible but looks like sqoop is > > > > > converting the data during import. > > > > > Example - The source value - elémeñt is loaded as elémeñt to HDFS. > > > > > Please provide a solution for this. > > > > > Thanks in advance! > > > > > > > > > 00000000 31 32 33 34 35 36 37 38 39 30 07 31 33 37 33 32 > > > |1234567890.13732| > > > 00000010 36 30 33 34 36 31 35 31 07 31 33 37 33 32 36 30 > > > |60346151.1373260| > > > 00000020 33 34 36 31 35 31 07 30 07 65 6c c3 83 c2 a9 6d > > > |346151.0.el....m| > > > 00000030 65 c3 83 c2 b1 74 07 c3 a8 c2 b4 c2 bc c3 a2 e2 > > > |e....t..........| > > > 00000040 80 9a c2 ac c3 ac e2 80 9a c2 ac c3 ac e2 80 93 > > > |................| > > > 00000050 c2 b4 c3 a8 e2 80 b0 c2 be c3 a8 c2 a5 c2 bf 0a > > > |................| > > > 00000060 > > > > > > Here is a sample command line .... > > sqoop --options-file $CONN_FILE --lines-terminated-by '\n' --verbose > > --query "<<QUERY>>' and \$CONDITIONS" -m 1 --target-dir > > $YYYY/$MM/$DD/${TBL_NAME} --null-string '\\N' --null-non-string '\\N' >> > > $LOGFILE 2>&1
signature.asc
Description: Digital signature
