Re: unicode characters aren't displaying correctly with data driven test

Jade Thu, 23 Sep 2010 12:52:37 -0700

Thank you so much Jian! I got busy with the tests, and I missed your
post. This is just the fix I needed.


Jade

On Aug 18, 12:21 am, Jian Fang <[email protected]> wrote:
> BTW, I used your data to created the following DDT module:
>
> class UnicodeModule extends TelluriumDataDrivenModule{
>   void defineModule() {
>     typeHandler "unicode", "org.telluriumsource.ddt.UnicodeTypeHandler"
>
>     fs.FieldSet(name: "record", description: "Data format for testing
> Unicode") {
>         Test(value: "testUnicode")
>         Field(name: "title", description: "test title")
>         Field(name: "abstract", type: "unicode", description: "abstract")
>         Field(name: "email", description: "email")
>         Field(name: "indicator", type: "boolean", description: "indicator")
>     }
>
>     defineTest("testUnicode") {
>       String title = bind("record.title")
>       String abst = bind("record.abstract")
>       String email = bind("record.email")
>       boolean indicator = bind("record.indicator")
>       println "$title, $abst, $email, $indicator"
>     }
>
>   }
>
> }
>
> where the "unicode" type handler is defined as follows.
>
> class UnicodeTypeHandler implements TypeHandler {
>
>   public String valueOf(String s) {
>     if(s == null || s.trim().length() == 0){
>       return s;
>     }
>     return parseUnicode(s);
>
> }
>
> You can find the whole test case from trunk/core.
>
> Let us know if you have further problems.
>
> Thanks,
>
> Jian
>
>
>
> On Wed, Aug 18, 2010 at 1:18 AM, Jian Fang <[email protected]> wrote:
> > Finally find some time to get back to this topic, I found a utility class
> > as follows to parse unicode:
>
> >   public static String parseUnicode(String input)
> >   {
> >       StringTokenizer st = new StringTokenizer(input, "\\", true);
>
> >       StringBuffer sb = new StringBuffer();
>
> >       while(st.hasMoreTokens())
> >       {
> >           String token = st.nextToken();
> >           if (token.charAt(0) == '\\' && token.length() == 1)
> >           {
> >               if(st.hasMoreTokens())
> >               {
> >                   token = st.nextToken();
> >               }
> >               if(token.charAt(0) == 'u')
> >               {
> >                   String hexnum;
> >                   if (token.length() > 5)
> >                   {
> >                       hexnum = token.substring(1,5);
> >                       token = token.substring(5);
> >                   }
> >                   else
> >                   {
> >                       hexnum = token.substring(1);
> >                       token = "";
> >                   }
> >                   sb.append((char)Integer.parseInt(hexnum, 16));
> >               }
> >           }
> >           sb.append(token);
> >       }
> >       return sb.toString();
>
> >   }
>
> > On Fri, Aug 6, 2010 at 3:53 PM, Jian Fang <[email protected]>wrote:
>
> >> Not sure if this helps:
>
> >>http://www.jguru.com/faq/view.jsp?EID=137049
>
> >> On Fri, Aug 6, 2010 at 3:49 PM, Jian Fang <[email protected]>wrote:
>
> >>> I see your problem here, in your input file, the unicode is presented as
> >>> plain text and thus,
> >>> the Java String also treats them as a String. One thing you can do is to
> >>> convert
> >>> the unicode String back to unicode, then do the conversion to utf-8.
>
> >>> For example, you can have a state machine, which traces the start
> >>> character "\u", i.e,
> >>> two characters "\", "u", then you should know it is a unicode for the
> >>> next couple characters.
>
> >>> There may be some better way to handle this. Need do some googling.
>
> >>> Thanks,
>
> >>> Jian
>
> >>> On Fri, Aug 6, 2010 at 12:57 PM, Jade <[email protected]> wrote:
>
> >>>> Hi Jian,
>
> >>>> Thank you for all of the information. I tried implementing the
> >>>> UnicodeTypeHandler as you mentioned. However, the new
> >>>> String(test.getBytes(),"UTF-8"); call isn't working correctly because
> >>>> the bytes at that point are already not correct.
>
> >>>> The input string s is:
>
> >>>> [\, Q, e, n, t, e, r, D, o, c, u, m, e, n, t, I, n, f, o, r, m, a, t,
> >>>> i, o, n, |, T, e, s, t,  , T, i, t, l, e, |, M, y,  , a, b, s, t, r,
> >>>> a, c, t,  , i, s,  , ., ., .,  , \, u, 0, 0, 4, E, \, u, 0, 0, F, C,
> >>>> \, u, 0, 0, 6, 8, \, u, 0, 0, 6, 5,  , &, |, t, e, s, t, ;,  , V, i,
> >>>> r, e, o, |, 3, |, n, o, -, r, e, p, l, y, @, t, d, l, ., o, r, g, |,
> >>>> t, r, u, e, \, E]
>
> >>>> and the part of the string that represents the data is:
>
> >>>> My abstract is ... \u004E\u00FC\u0068\u0065 &
>
> >>>> abstractText (in the data file): My abstract is ... \u004E\u00FC
> >>>> \u0068\u0065 &
>
> >>>> String test = "My abstract is ... \u004E\u00FC\u0068\u0065 &";
>
> >>>> In the console:
>
> >>>> abstractText: My abstract is ... \u004E\u00FC\u0068\u0065 &
> >>>> test: My abstract is ... Nühe &
>
> >>>> I looked at the data in the debugger:
>
> >>>> In the debugger, the \u is double-escaped: \\u
> >>>> c: My abstract is ... \u004E\u00FC\u0068\u0065 &
>
> >>>> Each char is seen as a char, the \u was not correctly interpreted:
>
> >>>> thus c is 45 chars.
> >>>> [M, y,  , a, b, s, t, r, a, c, t,  , i, s,  , ., ., .,  , \, u, 0, 0,
> >>>> 4, E, \, u, 0, 0, F, C, \, u, 0, 0, 6, 8, \, u, 0, 0, 6, 5,  , &]
>
> >>>> cBytes: [77, 121, 32, 97, 98, 115, 116, 114, 97, 99, 116, 32, 105,
> >>>> 115, 32, 46, 46, 46, 32, 92, 117, 48, 48, 52, 69, 92, 117, 48, 48, 70,
> >>>> 67, 92, 117, 48, 48, 54, 56, 92, 117, 48, 48, 54, 53, 32, 38]
>
> >>>> d: (25 chars) [M, y,  , a, b, s, t, r, a, c, t,  , i,
> >>>> s,  , ., ., .,  , N, ü, h, e,  , &]
>
> >>>> dBytes: [77, 121, 32, 97, 98, 115, 116, 114, 97, 99, 116, 32, 105,
> >>>> 115, 32, 46, 46, 46, 32, 78, -61, -68, 104, 101, 32, 38]
>
> >>>> Jade
>
> >>>> On Aug 5, 4:18 pm, Jian Fang <[email protected]> wrote:
> >>>> > To save your time, I post an example type handler here:
>
> >>>> ---------------------------------------------------------------------------
> >>>>  -------------------------
>
> >>>> > package org.telluriumsource.ut
>
> >>>> > import org.telluriumsource.test.ddt.mapping.type.TypeHandler
>
> >>>> > class PhoneNumberTypeHandler implements TypeHandler{
> >>>> >     protected final static String PHONE_SEPARATOR = "-"
> >>>> >     protected final static int PHONE_LENGTH = 12
>
> >>>> >     //remove the "-" inside the phone number
> >>>> >     public String valueOf(String s) {
> >>>> >         String value
>
> >>>> >         if(s != null && (s.length() > 0)){
> >>>> >              value = s.replaceAll(PHONE_SEPARATOR, "")
> >>>> >         }else {
> >>>> >             value = s
> >>>> >         }
>
> >>>> >         return value
> >>>> >     }
>
> >>>> > }
> >>>> > On Thu, Aug 5, 2010 at 5:16 PM, Jian Fang <[email protected]>
> >>>> wrote:
> >>>> > > Seems the following code could convert the uicode to a utf-8 string.
>
> >>>> > >   �...@test
> >>>> > >     public void testUicode(){
> >>>> > >         String test =
>
> >>>> "\u004E\u00FC\u0068\u0065\u00F0\u0061\u006E\u0020\u03AC\u03C1\u03C7";
> >>>> > >         try {
> >>>> > >             String c = new String(test.getBytes(),"UTF-8");
> >>>> > >             System.out.println("Converted: " + c);
> >>>> > >         } catch (UnsupportedEncodingException e) {
> >>>> > >             e.printStackTrace();
> >>>> > >         }
> >>>> > >     }
>
> >>>> > > For your data driven test, you need to create a custom type handler.
> >>>> Please
> >>>> > > see the example here:
>
> >>>>http://code.google.com/p/aost/wiki/UserGuide070DetailsOnTellurium#typ.
> >>>> ..
>
> >>>> > > Thanks,
>
> >>>> > > Jian
>
> >>>> > > On Thu, Aug 5, 2010 at 4:59 PM, Jian Fang <[email protected]
> >>>> >wrote:
>
> >>>> > >> Seems you need to create a custom handle to convert the unicode to
> >>>> "UTF8"
> >>>> > >> format. I will try to find some time to see if I can create some
> >>>> test code
> >>>> > >> for you.
> >>>> > >> Sorry for that, I am busy with Trump now.
>
> >>>> > >> Thanks,
>
> >>>> > >> Jian
>
> >>>> > >> On Thu, Aug 5, 2010 at 4:11 PM, Jade <[email protected]> wrote:
>
> >>>> > >>> Hi,
>
> >>>> > >>> Some of our test data includes unicode characters, such as:
>
> >>>> > >>> enterDocumentInformation|Test Title|My abstract is ...
> >>>> \u004E\u00FC
> >>>> > >>> \u0068\u0065 &|test; Vireo|3|[email protected]|true
>
> >>>> > >>> However, the unicode characters aren't being unencoded as they're
> >>>> read
> >>>> > >>> in and bound to the variable.
>
> >>>> > >>> String abstractText = bind("DocumentInformationData.abstract")
>
> >>>> > >>> println "abstractText: ${abstractText}"
>
> >>>> > >>> String test =
> >>>> "\u004E\u00FC\u0068\u0065\u00F0\u0061\u006E\u0020\u03AC
> >>>> > >>> \u03C1\u03C7"
> >>>> > >>> println "test: ${test}"
>
> >>>> > >>> abstractText: My abstract is ... \u004E\u00FC\u0068\u0065 &
> >>>> > >>> test: Nüheðan άρχ
>
> >>>> > >>> Is there another method that I need to call to unencode the
> >>>> unicode?
>
> >>>> > >>> Jade
>
> >>>> > >>> --
> >>>> > >>> You received this message because you are subscribed to the Google
> >>>> Groups
> >>>> > >>> "tellurium-users" group.
> >>>> > >>> To post to this group, send email to
> >>>> [email protected].
> >>>> > >>> To unsubscribe from this group, send email to
> >>>> > >>> [email protected]<tellurium-users%2Bunsubscribe@
> >>>> > >>>  googlegroups.com>
> >>>> <tellurium-users%[email protected]<tellurium-users%252Bunsubsc
> >>>>  [email protected]>
>
> >>>> > >>> .
> >>>> > >>> For more options, visit this group at
> >>>> > >>>http://groups.google.com/group/tellurium-users?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"tellurium-users" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tellurium-users?hl=en.

Re: unicode characters aren't displaying correctly with data driven test

Reply via email to