Re: unicode characters aren't displaying correctly with data driven test

Jian Fang Tue, 17 Aug 2010 22:21:08 -0700

BTW, I used your data to created the following DDT module:

class UnicodeModule extends TelluriumDataDrivenModule{
  void defineModule() {
    typeHandler "unicode", "org.telluriumsource.ddt.UnicodeTypeHandler"


    fs.FieldSet(name: "record", description: "Data format for testing
Unicode") {
        Test(value: "testUnicode")
        Field(name: "title", description: "test title")
        Field(name: "abstract", type: "unicode", description: "abstract")
        Field(name: "email", description: "email")
        Field(name: "indicator", type: "boolean", description: "indicator")
    }

    defineTest("testUnicode") {
      String title = bind("record.title")
      String abst = bind("record.abstract")
      String email = bind("record.email")
      boolean indicator = bind("record.indicator")
      println "$title, $abst, $email, $indicator"
    }

  }
}

where the "unicode" type handler is defined as follows.

class UnicodeTypeHandler implements TypeHandler {

  public String valueOf(String s) {
    if(s == null || s.trim().length() == 0){
      return s;
    }
    return parseUnicode(s);
}

You can find the whole test case from trunk/core.

Let us know if you have further problems.

Thanks,

Jian

On Wed, Aug 18, 2010 at 1:18 AM, Jian Fang <[email protected]> wrote:

> Finally find some time to get back to this topic, I found a utility class
> as follows to parse unicode:
>
>   public static String parseUnicode(String input)
>   {
>       StringTokenizer st = new StringTokenizer(input, "\\", true);
>
>       StringBuffer sb = new StringBuffer();
>
>       while(st.hasMoreTokens())
>       {
>           String token = st.nextToken();
>           if (token.charAt(0) == '\\' && token.length() == 1)
>           {
>               if(st.hasMoreTokens())
>               {
>                   token = st.nextToken();
>               }
>               if(token.charAt(0) == 'u')
>               {
>                   String hexnum;
>                   if (token.length() > 5)
>                   {
>                       hexnum = token.substring(1,5);
>                       token = token.substring(5);
>                   }
>                   else
>                   {
>                       hexnum = token.substring(1);
>                       token = "";
>                   }
>                   sb.append((char)Integer.parseInt(hexnum, 16));
>               }
>           }
>           sb.append(token);
>       }
>       return sb.toString();
>
>   }
>
>
> On Fri, Aug 6, 2010 at 3:53 PM, Jian Fang <[email protected]>wrote:
>
>>
>>
>>
>> Not sure if this helps:
>>
>> http://www.jguru.com/faq/view.jsp?EID=137049
>>
>>
>> On Fri, Aug 6, 2010 at 3:49 PM, Jian Fang <[email protected]>wrote:
>>
>>> I see your problem here, in your input file, the unicode is presented as
>>> plain text and thus,
>>> the Java String also treats them as a String. One thing you can do is to
>>> convert
>>> the unicode String back to unicode, then do the conversion to utf-8.
>>>
>>> For example, you can have a state machine, which traces the start
>>> character "\u", i.e,
>>> two characters "\", "u", then you should know it is a unicode for the
>>> next couple characters.
>>>
>>> There may be some better way to handle this. Need do some googling.
>>>
>>> Thanks,
>>>
>>> Jian
>>>
>>>
>>> On Fri, Aug 6, 2010 at 12:57 PM, Jade <[email protected]> wrote:
>>>
>>>> Hi Jian,
>>>>
>>>> Thank you for all of the information. I tried implementing the
>>>> UnicodeTypeHandler as you mentioned. However, the new
>>>> String(test.getBytes(),"UTF-8"); call isn't working correctly because
>>>> the bytes at that point are already not correct.
>>>>
>>>> The input string s is:
>>>>
>>>> [\, Q, e, n, t, e, r, D, o, c, u, m, e, n, t, I, n, f, o, r, m, a, t,
>>>> i, o, n, |, T, e, s, t,  , T, i, t, l, e, |, M, y,  , a, b, s, t, r,
>>>> a, c, t,  , i, s,  , ., ., .,  , \, u, 0, 0, 4, E, \, u, 0, 0, F, C,
>>>> \, u, 0, 0, 6, 8, \, u, 0, 0, 6, 5,  , &, |, t, e, s, t, ;,  , V, i,
>>>> r, e, o, |, 3, |, n, o, -, r, e, p, l, y, @, t, d, l, ., o, r, g, |,
>>>> t, r, u, e, \, E]
>>>>
>>>> and the part of the string that represents the data is:
>>>>
>>>> My abstract is ... \u004E\u00FC\u0068\u0065 &
>>>>
>>>> abstractText (in the data file): My abstract is ... \u004E\u00FC
>>>> \u0068\u0065 &
>>>>
>>>> String test = "My abstract is ... \u004E\u00FC\u0068\u0065 &";
>>>>
>>>> In the console:
>>>>
>>>> abstractText: My abstract is ... \u004E\u00FC\u0068\u0065 &
>>>> test: My abstract is ... Nühe &
>>>>
>>>> I looked at the data in the debugger:
>>>>
>>>> In the debugger, the \u is double-escaped: \\u
>>>> c: My abstract is ... \u004E\u00FC\u0068\u0065 &
>>>>
>>>> Each char is seen as a char, the \u was not correctly interpreted:
>>>>
>>>> thus c is 45 chars.
>>>> [M, y,  , a, b, s, t, r, a, c, t,  , i, s,  , ., ., .,  , \, u, 0, 0,
>>>> 4, E, \, u, 0, 0, F, C, \, u, 0, 0, 6, 8, \, u, 0, 0, 6, 5,  , &]
>>>>
>>>> cBytes: [77, 121, 32, 97, 98, 115, 116, 114, 97, 99, 116, 32, 105,
>>>> 115, 32, 46, 46, 46, 32, 92, 117, 48, 48, 52, 69, 92, 117, 48, 48, 70,
>>>> 67, 92, 117, 48, 48, 54, 56, 92, 117, 48, 48, 54, 53, 32, 38]
>>>>
>>>> d: (25 chars) [M, y,  , a, b, s, t, r, a, c, t,  , i,
>>>> s,  , ., ., .,  , N, ü, h, e,  , &]
>>>>
>>>> dBytes: [77, 121, 32, 97, 98, 115, 116, 114, 97, 99, 116, 32, 105,
>>>> 115, 32, 46, 46, 46, 32, 78, -61, -68, 104, 101, 32, 38]
>>>>
>>>>
>>>> Jade
>>>>
>>>> On Aug 5, 4:18 pm, Jian Fang <[email protected]> wrote:
>>>> > To save your time, I post an example type handler here:
>>>> >
>>>> >
>>>> ----------------------------------------------------------------------------------------------------
>>>> >
>>>> > package org.telluriumsource.ut
>>>> >
>>>> > import org.telluriumsource.test.ddt.mapping.type.TypeHandler
>>>> >
>>>> > class PhoneNumberTypeHandler implements TypeHandler{
>>>> >     protected final static String PHONE_SEPARATOR = "-"
>>>> >     protected final static int PHONE_LENGTH = 12
>>>> >
>>>> >     //remove the "-" inside the phone number
>>>> >     public String valueOf(String s) {
>>>> >         String value
>>>> >
>>>> >         if(s != null && (s.length() > 0)){
>>>> >              value = s.replaceAll(PHONE_SEPARATOR, "")
>>>> >         }else {
>>>> >             value = s
>>>> >         }
>>>> >
>>>> >         return value
>>>> >     }
>>>> >
>>>> > }
>>>> > On Thu, Aug 5, 2010 at 5:16 PM, Jian Fang <[email protected]>
>>>> wrote:
>>>> > > Seems the following code could convert the uicode to a utf-8 string.
>>>> >
>>>> > >    @Test
>>>> > >     public void testUicode(){
>>>> > >         String test =
>>>> > >
>>>> "\u004E\u00FC\u0068\u0065\u00F0\u0061\u006E\u0020\u03AC\u03C1\u03C7";
>>>> > >         try {
>>>> > >             String c = new String(test.getBytes(),"UTF-8");
>>>> > >             System.out.println("Converted: " + c);
>>>> > >         } catch (UnsupportedEncodingException e) {
>>>> > >             e.printStackTrace();
>>>> > >         }
>>>> > >     }
>>>> >
>>>> > > For your data driven test, you need to create a custom type handler.
>>>> Please
>>>> > > see the example here:
>>>> >
>>>> > >
>>>> http://code.google.com/p/aost/wiki/UserGuide070DetailsOnTellurium#typ.
>>>> ..
>>>> >
>>>> > > Thanks,
>>>> >
>>>> > > Jian
>>>> >
>>>> > > On Thu, Aug 5, 2010 at 4:59 PM, Jian Fang <[email protected]
>>>> >wrote:
>>>> >
>>>> > >> Seems you need to create a custom handle to convert the unicode to
>>>> "UTF8"
>>>> > >> format. I will try to find some time to see if I can create some
>>>> test code
>>>> > >> for you.
>>>> > >> Sorry for that, I am busy with Trump now.
>>>> >
>>>> > >> Thanks,
>>>> >
>>>> > >> Jian
>>>> >
>>>> > >> On Thu, Aug 5, 2010 at 4:11 PM, Jade <[email protected]> wrote:
>>>> >
>>>> > >>> Hi,
>>>> >
>>>> > >>> Some of our test data includes unicode characters, such as:
>>>> >
>>>> > >>> enterDocumentInformation|Test Title|My abstract is ...
>>>> \u004E\u00FC
>>>> > >>> \u0068\u0065 &|test; Vireo|3|[email protected]|true
>>>> >
>>>> > >>> However, the unicode characters aren't being unencoded as they're
>>>> read
>>>> > >>> in and bound to the variable.
>>>> >
>>>> > >>> String abstractText = bind("DocumentInformationData.abstract")
>>>> >
>>>> > >>> println "abstractText: ${abstractText}"
>>>> >
>>>> > >>> String test =
>>>> "\u004E\u00FC\u0068\u0065\u00F0\u0061\u006E\u0020\u03AC
>>>> > >>> \u03C1\u03C7"
>>>> > >>> println "test: ${test}"
>>>> >
>>>> > >>> abstractText: My abstract is ... \u004E\u00FC\u0068\u0065 &
>>>> > >>> test: Nüheðan άρχ
>>>> >
>>>> > >>> Is there another method that I need to call to unencode the
>>>> unicode?
>>>> >
>>>> > >>> Jade
>>>> >
>>>> > >>> --
>>>> > >>> You received this message because you are subscribed to the Google
>>>> Groups
>>>> > >>> "tellurium-users" group.
>>>> > >>> To post to this group, send email to
>>>> [email protected].
>>>> > >>> To unsubscribe from this group, send email to
>>>> > >>> [email protected]<tellurium-users%[email protected]>
>>>> <tellurium-users%[email protected]<tellurium-users%[email protected]>
>>>> >
>>>> > >>> .
>>>> > >>> For more options, visit this group at
>>>> > >>>http://groups.google.com/group/tellurium-users?hl=en.
>>>>
>>>
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tellurium-users" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tellurium-users?hl=en.

Re: unicode characters aren't displaying correctly with data driven test

Reply via email to