Finally find some time to get back to this topic, I found a utility class as
follows to parse unicode:
public static String parseUnicode(String input)
{
StringTokenizer st = new StringTokenizer(input, "\\", true);
StringBuffer sb = new StringBuffer();
while(st.hasMoreTokens())
{
String token = st.nextToken();
if (token.charAt(0) == '\\' && token.length() == 1)
{
if(st.hasMoreTokens())
{
token = st.nextToken();
}
if(token.charAt(0) == 'u')
{
String hexnum;
if (token.length() > 5)
{
hexnum = token.substring(1,5);
token = token.substring(5);
}
else
{
hexnum = token.substring(1);
token = "";
}
sb.append((char)Integer.parseInt(hexnum, 16));
}
}
sb.append(token);
}
return sb.toString();
}
On Fri, Aug 6, 2010 at 3:53 PM, Jian Fang <[email protected]> wrote:
>
>
>
> Not sure if this helps:
>
> http://www.jguru.com/faq/view.jsp?EID=137049
>
>
> On Fri, Aug 6, 2010 at 3:49 PM, Jian Fang <[email protected]>wrote:
>
>> I see your problem here, in your input file, the unicode is presented as
>> plain text and thus,
>> the Java String also treats them as a String. One thing you can do is to
>> convert
>> the unicode String back to unicode, then do the conversion to utf-8.
>>
>> For example, you can have a state machine, which traces the start
>> character "\u", i.e,
>> two characters "\", "u", then you should know it is a unicode for the next
>> couple characters.
>>
>> There may be some better way to handle this. Need do some googling.
>>
>> Thanks,
>>
>> Jian
>>
>>
>> On Fri, Aug 6, 2010 at 12:57 PM, Jade <[email protected]> wrote:
>>
>>> Hi Jian,
>>>
>>> Thank you for all of the information. I tried implementing the
>>> UnicodeTypeHandler as you mentioned. However, the new
>>> String(test.getBytes(),"UTF-8"); call isn't working correctly because
>>> the bytes at that point are already not correct.
>>>
>>> The input string s is:
>>>
>>> [\, Q, e, n, t, e, r, D, o, c, u, m, e, n, t, I, n, f, o, r, m, a, t,
>>> i, o, n, |, T, e, s, t, , T, i, t, l, e, |, M, y, , a, b, s, t, r,
>>> a, c, t, , i, s, , ., ., ., , \, u, 0, 0, 4, E, \, u, 0, 0, F, C,
>>> \, u, 0, 0, 6, 8, \, u, 0, 0, 6, 5, , &, |, t, e, s, t, ;, , V, i,
>>> r, e, o, |, 3, |, n, o, -, r, e, p, l, y, @, t, d, l, ., o, r, g, |,
>>> t, r, u, e, \, E]
>>>
>>> and the part of the string that represents the data is:
>>>
>>> My abstract is ... \u004E\u00FC\u0068\u0065 &
>>>
>>> abstractText (in the data file): My abstract is ... \u004E\u00FC
>>> \u0068\u0065 &
>>>
>>> String test = "My abstract is ... \u004E\u00FC\u0068\u0065 &";
>>>
>>> In the console:
>>>
>>> abstractText: My abstract is ... \u004E\u00FC\u0068\u0065 &
>>> test: My abstract is ... Nühe &
>>>
>>> I looked at the data in the debugger:
>>>
>>> In the debugger, the \u is double-escaped: \\u
>>> c: My abstract is ... \u004E\u00FC\u0068\u0065 &
>>>
>>> Each char is seen as a char, the \u was not correctly interpreted:
>>>
>>> thus c is 45 chars.
>>> [M, y, , a, b, s, t, r, a, c, t, , i, s, , ., ., ., , \, u, 0, 0,
>>> 4, E, \, u, 0, 0, F, C, \, u, 0, 0, 6, 8, \, u, 0, 0, 6, 5, , &]
>>>
>>> cBytes: [77, 121, 32, 97, 98, 115, 116, 114, 97, 99, 116, 32, 105,
>>> 115, 32, 46, 46, 46, 32, 92, 117, 48, 48, 52, 69, 92, 117, 48, 48, 70,
>>> 67, 92, 117, 48, 48, 54, 56, 92, 117, 48, 48, 54, 53, 32, 38]
>>>
>>> d: (25 chars) [M, y, , a, b, s, t, r, a, c, t, , i,
>>> s, , ., ., ., , N, ü, h, e, , &]
>>>
>>> dBytes: [77, 121, 32, 97, 98, 115, 116, 114, 97, 99, 116, 32, 105,
>>> 115, 32, 46, 46, 46, 32, 78, -61, -68, 104, 101, 32, 38]
>>>
>>>
>>> Jade
>>>
>>> On Aug 5, 4:18 pm, Jian Fang <[email protected]> wrote:
>>> > To save your time, I post an example type handler here:
>>> >
>>> >
>>> ----------------------------------------------------------------------------------------------------
>>> >
>>> > package org.telluriumsource.ut
>>> >
>>> > import org.telluriumsource.test.ddt.mapping.type.TypeHandler
>>> >
>>> > class PhoneNumberTypeHandler implements TypeHandler{
>>> > protected final static String PHONE_SEPARATOR = "-"
>>> > protected final static int PHONE_LENGTH = 12
>>> >
>>> > //remove the "-" inside the phone number
>>> > public String valueOf(String s) {
>>> > String value
>>> >
>>> > if(s != null && (s.length() > 0)){
>>> > value = s.replaceAll(PHONE_SEPARATOR, "")
>>> > }else {
>>> > value = s
>>> > }
>>> >
>>> > return value
>>> > }
>>> >
>>> > }
>>> > On Thu, Aug 5, 2010 at 5:16 PM, Jian Fang <[email protected]>
>>> wrote:
>>> > > Seems the following code could convert the uicode to a utf-8 string.
>>> >
>>> > > @Test
>>> > > public void testUicode(){
>>> > > String test =
>>> > > "\u004E\u00FC\u0068\u0065\u00F0\u0061\u006E\u0020\u03AC\u03C1\u03C7";
>>> > > try {
>>> > > String c = new String(test.getBytes(),"UTF-8");
>>> > > System.out.println("Converted: " + c);
>>> > > } catch (UnsupportedEncodingException e) {
>>> > > e.printStackTrace();
>>> > > }
>>> > > }
>>> >
>>> > > For your data driven test, you need to create a custom type handler.
>>> Please
>>> > > see the example here:
>>> >
>>> > >
>>> http://code.google.com/p/aost/wiki/UserGuide070DetailsOnTellurium#typ...
>>> >
>>> > > Thanks,
>>> >
>>> > > Jian
>>> >
>>> > > On Thu, Aug 5, 2010 at 4:59 PM, Jian Fang <[email protected]
>>> >wrote:
>>> >
>>> > >> Seems you need to create a custom handle to convert the unicode to
>>> "UTF8"
>>> > >> format. I will try to find some time to see if I can create some
>>> test code
>>> > >> for you.
>>> > >> Sorry for that, I am busy with Trump now.
>>> >
>>> > >> Thanks,
>>> >
>>> > >> Jian
>>> >
>>> > >> On Thu, Aug 5, 2010 at 4:11 PM, Jade <[email protected]> wrote:
>>> >
>>> > >>> Hi,
>>> >
>>> > >>> Some of our test data includes unicode characters, such as:
>>> >
>>> > >>> enterDocumentInformation|Test Title|My abstract is ... \u004E\u00FC
>>> > >>> \u0068\u0065 &|test; Vireo|3|[email protected]|true
>>> >
>>> > >>> However, the unicode characters aren't being unencoded as they're
>>> read
>>> > >>> in and bound to the variable.
>>> >
>>> > >>> String abstractText = bind("DocumentInformationData.abstract")
>>> >
>>> > >>> println "abstractText: ${abstractText}"
>>> >
>>> > >>> String test =
>>> "\u004E\u00FC\u0068\u0065\u00F0\u0061\u006E\u0020\u03AC
>>> > >>> \u03C1\u03C7"
>>> > >>> println "test: ${test}"
>>> >
>>> > >>> abstractText: My abstract is ... \u004E\u00FC\u0068\u0065 &
>>> > >>> test: Nüheðan άρχ
>>> >
>>> > >>> Is there another method that I need to call to unencode the
>>> unicode?
>>> >
>>> > >>> Jade
>>> >
>>> > >>> --
>>> > >>> You received this message because you are subscribed to the Google
>>> Groups
>>> > >>> "tellurium-users" group.
>>> > >>> To post to this group, send email to
>>> [email protected].
>>> > >>> To unsubscribe from this group, send email to
>>> > >>> [email protected]<tellurium-users%[email protected]>
>>> <tellurium-users%[email protected]<tellurium-users%[email protected]>
>>> >
>>> > >>> .
>>> > >>> For more options, visit this group at
>>> > >>>http://groups.google.com/group/tellurium-users?hl=en.
>>>
>>
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"tellurium-users" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tellurium-users?hl=en.