Thanks, Sujit.The main problem I'm having is with normalizing the wide range of unicode white space characters (e.g. u+0085, U+00A0...) to U+0020 before squeezing - the only thing I can find is the isWhitespace() function which would require iterating over each of the characters in the string and testing/replacing them individually. I was wondering if there was a charset pattern that squeeze could take that would represent all unicode white space characters?
S On 29 Oct 2009, at 18:26, Sujit Pal wrote:
Hi Scott, I just use something like this: s = s.replaceAll("\\s+", " "); or since you are doing unicode: String s = "This\u0200\u0200is\u0200a\u0200\u0200test"; System.out.println("before=" + s); s = s.replaceAll("\u0200+", "\u0200"); System.out.println("after=" + s); Gives me this: before=ThisȀȀisȀaȀȀtest after=ThisȀisȀaȀtestOf course, you lose the null checking that commons-lang gives you. UsingCharsetUtils.squeeze() also gives me identical results... String s = "This\u0200\u0200is\u0200a\u0200\u0200test"; System.out.println("before=" + s); s = org.apache.commons.lang.CharSetUtils.squeeze(s, new String[] {"\u0200"}); System.out.println("after=" + s);Also changed your subject line to include [lang] per guidelines on thislist. -sujit On Thu, 2009-10-29 at 16:21 +0000, Scott Wilson wrote:Hi everyone, I need to implement a W3C processing algorithm which states: 10.1.8 Rule for Getting Text Content with Normalized White SpaceThe rule for getting text content with normalized white space is givenin the following algorithm. The algorithm always returns a string, which MAY be empty. • Let input be the Element to be processed. • Let result be the result of applying the rule for getting text content to input.• In result, convert any sequence of one or more Unicode white spacecharacters into a single U+0020 SPACE. • Return result. The step I'm having problems with is "convert any sequence of one or more Unicode white space characters into a single U+0020 SPACE." The StringUtils replace() and CharSetUtils squeeze() methods wouldseem to be best suited for solving this one, but there doesn't seem tobe a set syntax for easily specifying unicode white space chars defined for one thing.Has anyone else solved a similar problem using commons lang, or shouldI consider using something else? Thanks! S /-/-/-/-/-/ Scott Wilson Apache Wookie: http://incubator.apache.org/projects/wookie.html--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
smime.p7s
Description: S/MIME cryptographic signature
