Thank you for sharing your final solution Varun!

Jarcec

On Mon, Jul 22, 2013 at 09:30:25PM -0700, varun kumar gullipalli wrote:
> Hi Jarcec,
> 
> Thanks for your help. 
> 
> Found that the connection encoding is Latin 1 and adding the following to 
> my.cnf on the hadoop server fixed the utf8 load issue.
> [client]
> default-character-set=utf8
> 
> 
> Thanks,
> Varun
> 
> 
> ________________________________
>  From: Jarek Jarcec Cecho <[email protected]>
> To: [email protected]; varun kumar gullipalli <[email protected]> 
> Sent: Sunday, July 21, 2013 9:57 AM
> Subject: Re: Sqoop - utf-8 data load issue
>  
> 
> Hi Varun,
> I've never had issues with Sqooping UTF data. Do you think that you can do 
> mysqldump of the table in question?  If you could share it with the Sqoop 
> version and exact command line I would like to explore that a bit.
> 
> Jarcec
> 
> On Wed, Jul 17, 2013 at 05:42:01PM -0700, varun kumar gullipalli wrote:
> > Thanks Jarcec.
> > sqoop version is 1.4.2
> >  
> >  
> > I was verifying the QueryResult.java file that sqoop creates; type is the 
> > column name which has multi-byte data(utf-8). 
> > Does declaring type as string work for multi-byte data?
> >  
> > grep type QueryResult.java
> >   private String type;
> >   public String get_type() {
> >     return type;
> >   public void set_type(String type) {
> >     this.type = type;
> >   public QueryResult with_type(String type) {
> >     this.type = type;
> >     equal = equal && (this.type == null ? that.type == null : 
> > this.type.equals(that.type));
> >     this.type = JdbcWritableBridge.readString(5, __dbResults);
> >     JdbcWritableBridge.writeString(type, 5 + __off, 12, __dbStmt);
> >         this.type = null;
> >     this.type = Text.readString(__dataIn);
> >     if (null == this.type) {
> >     Text.writeString(__dataOut, type);
> >     __sb.append(FieldFormatter.escapeAndEnclose(type==null?"\\N":type 
> > (file://n%22:type/), delimiters));
> >     if (__cur_str.equals("null")) { this.type = null; } else {
> >       this.type = __cur_str;
> >     __sqoop$field_map.put("type", this.type);
> >     else    if ("type".equals(__fieldName)) {
> >       this.type = (String) __fieldVal;
> > 
> >   
> > 
> > 
> > 
> > ________________________________
> > From: Jarek Jarcec Cecho <[email protected]>
> > To: [email protected]; varun kumar gullipalli <[email protected]> 
> > Sent: Wednesday, July 17, 2013 8:36 AM
> > Subject: Re: Sqoop - utf-8 data load issue
> > 
> > 
> > Thank you Varun,
> > the sequence c3 83 c2 a9 indeed do not correspond to correct character. I 
> > was able to google out one entry in stack overflow [1] that might be 
> > relevant to your issue somehow. I've tried to reproduce this on my cluster, 
> > but I was not able to. Do you think that you can do mysqldump of the table 
> > in question?  If you could share it with the Sqoop version and exact 
> > command line I would like to explore that a bit.
> > 
> > Jarcec
> > 
> > Links:
> > 1: 
> > http://stackoverflow.com/questions/8499852/xmldocument-mis-reads-utf-8-e-acute-character
> > 
> > On Tue, Jul 16, 2013 at 04:24:49PM -0700, varun kumar gullipalli wrote:
> > > Here is the output Jarcec...
> > >  
> > >  
> > > 
> > > 
> > > ________________________________
> > > From: Jarek Jarcec Cecho <[email protected]>
> > > To: [email protected]; varun kumar gullipalli 
> > > <[email protected]> 
> > > Sent: Tuesday, July 16, 2013 11:05 AM
> > > Subject: Re: Sqoop - utf-8 data load issue
> > > 
> > > 
> > > Thank you for the additional information Varun! Would you mind doing 
> > > something like the following:
> > > 
> > > hadoop dfs -text THE_FILE  | hexdump -C
> > > 
> > > And sharing the output? I'm trying to see the actual content of the file 
> > > rather than any interpreted value.
> > > 
> > > Jarcec
> > > 
> > > On Mon, Jul 15, 2013 at 06:52:11PM -0700, varun kumar gullipalli wrote:
> > > > Hi Jarcec,
> > > > 
> > > > I am validating the data by running the following command,
> > > > 
> > > > hadoop fs -text <hdfs cluster>
> > > > 
> > > > I think there is no issue with the shell (correct me if am wrong) 
> > > > because I am connecting to MySQL database from the same shell(command 
> > > > line) and  could view the source data properly.
> > > > 
> > > > Initially we observed that the following conf files doesn't have utf-8 
> > > > encoding. 
> > > > <?xml version="1.0" encoding="UTF-8"?>
> > > > 
> > > > sqoop-site.xml
> > > > sqoop=site-template.xml
> > > > 
> > > > But no luck after making the changes too.
> > > > 
> > > > Thanks,
> > > > Varun
> > > > 
> > > > 
> > > > ________________________________
> > > >  From: Jarek Jarcec Cecho <[email protected]>
> > > > To: [email protected]; varun kumar gullipalli 
> > > > <[email protected]> 
> > > > Sent: Monday, July 15, 2013 6:37 PM
> > > > Subject: Re: Sqoop - utf-8 data load issue
> > > >  
> > > > 
> > > > Hi Varun,
> > > > we are usually not seeing any issues with transferring text data in 
> > > > UTF. How are
> > > > you validating the imported file? I can imagine that your shell might 
> > > > be messing
> > > > the encoding.
> > > > 
> > > > Jarcec
> > > > 
> > > > On Mon, Jul 15, 2013 at 06:27:25PM -0700, varun kumar gullipalli wrote:
> > > > > 
> > > > > 
> > > > > Hi,
> > > > > I am importing data from MySql to HDFS using free-form query import.
> > > > > It works fine but facing issue when the data is utf-8.The 
> > > > > source(MySql) db is utf-8 compatible but looks like sqoop is 
> > > > > converting the data during import.
> > > > > Example - The source value - elémeñt is loaded as elémeñt to HDFS.
> > > > > Please provide a solution for this.
> > > > > Thanks in advance!
> > > 
> > > 
> > > 00000000  31 32 33 34 35 36 37 38  39 30 07 31 33 37 33 32  
> > > |1234567890.13732|
> > > 00000010  36 30 33 34 36 31 35 31  07 31 33 37 33 32 36 30  
> > > |60346151.1373260|
> > > 00000020  33 34 36 31 35 31 07 30  07 65 6c c3 83 c2 a9 6d  
> > > |346151.0.el....m|
> > > 00000030  65 c3 83 c2 b1 74 07 c3  a8 c2 b4 c2 bc c3 a2 e2  
> > > |e....t..........|
> > > 00000040  80 9a c2 ac c3 ac e2 80  9a c2 ac c3 ac e2 80 93  
> > > |................|
> > > 00000050  c2 b4 c3 a8 e2 80 b0 c2  be c3 a8 c2 a5 c2 bf 0a  
> > > |................|
> > > 00000060
> > 
> > 
> > Here is a sample command line ....
> >   sqoop --options-file $CONN_FILE --lines-terminated-by '\n' --verbose 
> > --query "<<QUERY>>' and  \$CONDITIONS" -m 1 --target-dir 
> > $YYYY/$MM/$DD/${TBL_NAME} --null-string '\\N' --null-non-string '\\N' >> 
> > $LOGFILE 2>&1

Attachment: signature.asc
Description: Digital signature

Reply via email to