Re: Truncating UTF-8 Strings (Resolved)

Klaus Berkling Wed, 10 Jun 2009 08:43:18 -0700


On Jun 8, 2009, at 2:23 PM, Klaus Berkling wrote:

Hi all.  This seems it should work but it doesn't.
I truncate a string that may contain Japanese characters, purely for display purposes. Double byte or multi-byte characters are split appart.
Results look like this:
お使いのコンピュータにDVDドライブが搭載れているかは�?
[...]
Here is the code:
public String stringWithNoHTML(String aStringWithHTML, int lengthTruncated) {
        String returnValue = null;
        if (aStringWithHTML != null && aStringWithHTML.length() > 0) {

                //StringBuffer textBlock = new StringBuffer(aStringWithHTML);
                StringBuffer textBlock = new StringBuffer();
Pattern htmlTagPattern = Pattern.compile("<(.|\n|\r)+?>|&[a-zA- Z0-9]+;");
                Matcher lineBreakMatcher = 
htmlTagPattern.matcher(aStringWithHTML);

                boolean results = lineBreakMatcher.find();
                while (results)
                {
                        lineBreakMatcher.appendReplacement(textBlock, " ");
                        results = lineBreakMatcher.find();
                }
                lineBreakMatcher.appendTail(textBlock);

                if (lengthTruncated > 0 && textBlock.length() > SUMMARY_LENGTH) 
{
                        try {
returnValue = new String(textBlock.toString().getBytes("UTF-8"), 0, lengthTruncated, "UTF-8");
                        } catch (UnsupportedEncodingException ex) {
                                returnValue = null;
                        }
//returnValue = new String(textBlock.substring(0, lengthTruncated) + "...");
                } else
                        returnValue = textBlock.toString();
        }
        return returnValue;
}
The original string may contain single byte characters as well. I expect the string to be properly truncated and not chop off bytes of the characters. It works fine with single byte characters.
Using
returnValue = new String(textBlock.toString().getBytes("UTF-8"), 0, lengthTruncated, "UTF-8");
or
returnValue = new String(textBlock.substring(0, lengthTruncated) + "...");
makes no difference.
I also bypassed the regex patter and still see the same problem.

Files, components, class, etc. are in UTF-8.


(For the archive)

After a chat with the Java people at WWDC, this code seems to make the proper truncation:


int correctLengthTrucated = lengthTruncated;
while (correctLengthTrucated > 0)

//if ( Character.isWhitespace(textBlock.charAt(correctLengthTrucated)) )

        if ( Character.isLetter(textBlock.charAt(correctLengthTrucated)) )
                break;
        else
                correctLengthTrucated--;

returnValue = new String(textBlock.substring(0, correctLengthTrucated) + "...");



Thanks to all who helped.

kib

"Success is not final, failure is not fatal: it is the courage to continue that counts."

Winston Churchill

Klaus Berkling
Systems Administrator
DynEd International, Inc.
www.dyned.com | www.eskimo.com/~kiberkli

smime.p7s
Description: S/MIME cryptographic signature

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      ([email protected])
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/webobjects-dev/archive%40mail-archive.com

This email sent to [email protected]

Re: Truncating UTF-8 Strings (Resolved)

Reply via email to