Scott Wilson wrote:
Well after a bit of research I finally found a solution to this problem, and though StringUtils and CharSetUtils play a role, there was still a bit of a gap.

Here is the code:

private static String normalize(String in, boolean includeWhitespace){
if (in == null) return "";
String out = "";
for (int x=0;x<in.length();x++){
String s = in.substring(x, x+1);
char ch = s.charAt(0);
if (Character.isSpaceChar(ch) || (Character.isWhitespace(ch) && includeWhitespace)){
s = " ";
}
out = out + s;
}
out = CharSetUtils.squeeze(out, " ");
out = StringUtils.strip(out);
return out;
}

Interestingly enough there is no "normalize unicode white space/space chars" method in any of the libs that I tested (e.g. jdom, dom4j).

Surely a simple regex does it?

Sujit posted:
s = s.replaceAll("\\s+", " ");

or since you are doing unicode:

String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
System.out.println("before=" + s);
s = s.replaceAll("\u0200+", "\u0200");
System.out.println("after=" + s);


But (reading the regexp documentation), there's
\p{javaWhitespace}      Equivalent to java.lang.Character.isWhitespace()

which appears to do just what's wanted.

  BugBear

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to