Re: [lang] collapsing unicode white space

Scott Wilson Fri, 30 Oct 2009 11:20:34 -0700

Thanks, Sujit.

The main problem I'm having is with normalizing the wide range of unicode white space characters (e.g. u+0085, U+00A0...) to U+0020 before squeezing - the only thing I can find is the isWhitespace() function which would require iterating over each of the characters in the string and testing/replacing them individually. I was wondering if there was a charset pattern that squeeze could take that would represent all unicode white space characters?


S

On 29 Oct 2009, at 18:26, Sujit Pal wrote:

Hi Scott,

I just use something like this:

s = s.replaceAll("\\s+", " ");

or since you are doing unicode:

String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
System.out.println("before=" + s);
s = s.replaceAll("\u0200+", "\u0200");
System.out.println("after=" + s);

Gives me this:
before=ThisȀȀisȀaȀȀtest
after=ThisȀisȀaȀtest

Of course, you lose the null checking that commons-lang gives you. Using

CharsetUtils.squeeze() also gives me identical results...

String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
System.out.println("before=" + s);
s = org.apache.commons.lang.CharSetUtils.squeeze(s, new String[]
{"\u0200"});
System.out.println("after=" + s);

Also changed your subject line to include [lang] per guidelines on this

list.

-sujit

On Thu, 2009-10-29 at 16:21 +0000, Scott Wilson wrote:

Hi everyone,

I need to implement a W3C processing algorithm which states:

10.1.8 Rule for Getting Text Content with Normalized White Space

The rule for getting text content with normalized white space is given

in the following algorithm. The algorithm always returns a string,
which MAY be empty.

        • Let input be the Element to be processed.
        • Let result be the result of applying the rule for getting text
content to input.

• In result, convert any sequence of one or more Unicode white space

characters into a single U+0020 SPACE.
        • Return result.

The step I'm having problems with is "convert any sequence of one or
more Unicode white space characters into a single U+0020 SPACE."

The StringUtils replace() and CharSetUtils squeeze() methods would

seem to be best suited for solving this one, but there doesn't seem to

be a set syntax for easily specifying unicode white space chars
defined for one thing.

Has anyone else solved a similar problem using commons lang, or should

I consider using something else?

Thanks!

S


/-/-/-/-/-/
Scott Wilson
Apache Wookie: http://incubator.apache.org/projects/wookie.html



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

smime.p7s
Description: S/MIME cryptographic signature

Re: [lang] collapsing unicode white space

Reply via email to