Re: Unicode script support in Regex and Character class

2010-05-11 Thread Xueming Shen
Ulf Zibis wrote: Am 10.05.2010 19:53, schrieb Ulf Zibis: Am 10.05.2010 03:05, schrieb Xueming Shen: Ulf, Can you be more specific? I'm not sure I understand your question. What "buffering" are we talking here? In http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev , I think byte[]

Re: Unicode script support in Regex and Character class

2010-05-11 Thread Ulf Zibis
Am 11.05.2010 18:41, schrieb Xueming Shen: Ulf Zibis wrote: SOME of my comments below ARE ment for http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev I marked the others. ;-) - use Arrays.binarySearch() in Character.UnicodeBlock.of(). This one can be discussed in a separate thread,

Re: Unicode script support in Regex and Character class

2010-05-11 Thread Xueming Shen
Ulf Zibis wrote: SOME of my comments below ARE ment for http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev I marked the others. ;-) - use Arrays.binarySearch() in Character.UnicodeBlock.of(). This one can be discussed in a separate thread, I would prefer to stay with the script supp

Re: Unicode script support in Regex and Character class

2010-05-11 Thread Ulf Zibis
Am 10.05.2010 19:53, schrieb Ulf Zibis: Am 10.05.2010 03:05, schrieb Xueming Shen: Ulf, Can you be more specific? I'm not sure I understand your question. What "buffering" are we talking here? In http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev , I think byte[] ba could be saved

Re: Unicode script support in Regex and Character class

2010-05-11 Thread Ulf Zibis
SOME of my comments below ARE ment for http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev I marked the others. ;-) -Ulf Am 11.05.2010 02:05, schrieb Xueming Shen: Ulf, My apology for distracting you to that "smaller size alternative", as I said in my previous email please only "re

Re: Unicode script support in Regex and Character class

2010-05-10 Thread Xueming Shen
Ulf, My apology for distracting you to that "smaller size alternative", as I said in my previous email please only "review" the bits at http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev It's fine if you are interested in the stuff I experimented at http://cr.openjdk.java.net/~sherman/

Re: Unicode script support in Regex and Character class

2010-05-10 Thread Ulf Zibis
Some additional thoughts: - out.writeShort((short)(num & 0x)); ---short form---> out.writeShort((short)num); - use Arrays.binarySearch() in Character.UnicodeBlock.of(). - "if (notFirst)" could be saved if you would first append the first word to sb outside the while loop. - StringBuilder

Re: Unicode script support in Regex and Character class

2010-05-10 Thread Xueming Shen
Ulf, Stuff under http://cr.openjdk.java.net/~sherman/script/webrev.00 just an idea about a smaller-size alternative It is not a intended to replace the final bits for review at http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev My bad, probably I should not mixed 2 things in one email.

Re: Unicode script support in Regex and Character class

2010-05-10 Thread Ulf Zibis
Am 10.05.2010 03:05, schrieb Xueming Shen: Ulf, Can you be more specific? I'm not sure I understand your question. What "buffering" are we talking here? In http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev , I think byte[] ba could be saved in initNamePool(), as you could directly

Re: Unicode script support in Regex and Character class

2010-05-09 Thread Xueming Shen
Ulf, Can you be more specific? I'm not sure I understand your question. What "buffering" are we talking here? If you are referring to code below dis = new DataInputStream(new InflaterInputStream( AccessController.doPrivileged(new PrivilegedAction() { public InputS

Re: Unicode script support in Regex and Character class

2010-05-09 Thread Ulf Zibis
Sherman, I don't understand, why you use so much buffering. InputStream from getResourceAsStream, and I believe InflaterInputStream too, is yet buffered. My understanding until now was, that access to buffered byte streams is as fast as to naked byte arrays. Am I wrong? -Ulf Am 08.05.2010

Re: Unicode script support in Regex and Character class

2010-05-08 Thread Xueming Shen
Hi, The API proposals for Unicode script support below have been approved. 6945564: Unicode script support in Character class 6948903: Make Unicode scripts available for use in regular expressions Here is the final webrev ready for push. http://cr.openjdk.java.net/~sherman/6945564_6948903/web

Re: Unicode script support in Regex and Character class

2010-04-30 Thread Xueming Shen
Hi, #4860714 has been closed as a dup (to workaround an internal process problem) of my newly created #6948903 for the regex script support. So here are the CCC drafts for 6945564: Unicode script support in Character class 6948903: Make Unicode scripts available for use in regular expressions

Re: Unicode script support in Regex and Character class

2010-04-29 Thread Ulf Zibis
I have corrected the statistics: current code from Sherman: - A Map.Entry object counts 24 bytes (40 on 64-bit machine) - An Integer object for the key counts 12 bytes (20 on 64-bit machine) - A String object counts 36 + 2*length, so for average character name length of 26: 88 bytes (102

Re: Unicode script support in Regex and Character class

2010-04-29 Thread Ulf Zibis
Am 24.04.2010 01:09, schrieb Xueming Shen: Yes, the final table takes about 500k, we might consider to use a weakref or something, if memory really a concern. But the table will get initialized only if you invoke Character.getName(), Sherman, how did you compute that value: - A Map.Entry obj

Re: Unicode script support in Regex and Character class

2010-04-27 Thread Ulf Zibis
Am 27.04.2010 19:03, schrieb Xueming Shen: Ulf Zibis wrote: I'm wondering, as script.txt only has ~120k. Ulf, you know we are not talking about Unicode scirpt but Unicode character name here, right? Unicode character name data is stored in UnicodeData.txt, you can find it at make/tools/Unic

Re: Unicode script support in Regex and Character class

2010-04-27 Thread Xueming Shen
Ulf Zibis wrote: Am 27.04.2010 06:25, schrieb Xueming Shen: Ulf Zibis wrote: Am 24.04.2010 01:09, schrieb Xueming Shen: I changed the data file "format" a bit, so now the overal uniName.dat is less than 88k (last version is 122+k), but the I can no long use cpLen as the capacity for the hash

Re: Unicode script support in Regex and Character class

2010-04-27 Thread Ulf Zibis
Oops, added attachment. -Ulf Am 27.04.2010 16:35, schrieb Ulf Zibis: Am 27.04.2010 06:25, schrieb Xueming Shen: Ulf Zibis wrote: Am 24.04.2010 01:09, schrieb Xueming Shen: I changed the data file "format" a bit, so now the overal uniName.dat is less than 88k (last version is 122+k), but th

Re: Unicode script support in Regex and Character class

2010-04-27 Thread Ulf Zibis
Am 27.04.2010 06:25, schrieb Xueming Shen: Ulf Zibis wrote: Am 24.04.2010 01:09, schrieb Xueming Shen: I changed the data file "format" a bit, so now the overal uniName.dat is less than 88k (last version is 122+k), but the I can no long use cpLen as the capacity for the hashmap. I'm now usin

Re: Unicode script support in Regex and Character class

2010-04-26 Thread Xueming Shen
Ulf Zibis wrote: I would like to have the 3 special cases INHERITED, COMMON and UNKNOWN together at the beginning or end of the enum list. Why? Since the current list is generated by the script from the Scripts.txt, it's in the order of what they are in the Scripts.txt, any particular reason

Re: Unicode script support in Regex and Character class

2010-04-26 Thread Xueming Shen
Ulf Zibis wrote: Am 24.04.2010 01:09, schrieb Xueming Shen: I changed the data file "format" a bit, so now the overal uniName.dat is less than 88k (last version is 122+k), but the I can no long use cpLen as the capacity for the hashmap. I'm now using a hardcoded 2 for 5.2. Again, is 88k

Re: Unicode script support in Regex and Character class

2010-04-26 Thread Ulf Zibis
Am 24.04.2010 01:09, schrieb Xueming Shen: I changed the data file "format" a bit, so now the overal uniName.dat is less than 88k (last version is 122+k), but the I can no long use cpLen as the capacity for the hashmap. I'm now using a hardcoded 2 for 5.2. Again, is 88k the compressed or

Re: Unicode script support in Regex and Character class

2010-04-26 Thread Ulf Zibis
Am 27.04.2010 00:01, schrieb Xueming Shen: Ulf Zibis wrote: I would like to see the full names redundantly in the aliases map. Needs only ~100 * (4 + 4) bytes in HashMap. This is the implementation details, we can defer the difference for now. I said that with the alternative of UnicodeScript

Re: Unicode script support in Regex and Character class

2010-04-26 Thread Xueming Shen
Ulf Zibis wrote: Am 26.04.2010 07:28, schrieb Xueming Shen: Can I assume we are all OK with at least the API part of the latest webrev/blenderrev of the script support in j.l.Character and j.u.r.Pattern, including the j.l.Chareacter.getName(). I guess you mean: public static enum Unicod

Re: Unicode script support in Regex and Character class

2010-04-26 Thread Xueming Shen
Ulf Zibis wrote: Am 24.04.2010 01:09, schrieb Xueming Shen: Ulf Zibis wrote: - I like the idea, saving the data in a compressed binary file, instead classfile static data. - wouldn't PreHashMaps be faster initialized as a normal HashMaps in j.l.Character.UnicodeScript and j.l.CharacterName?

Re: Unicode script support in Regex and Character class

2010-04-26 Thread Ulf Zibis
Am 24.04.2010 01:09, schrieb Xueming Shen: Ulf Zibis wrote: - I like the idea, saving the data in a compressed binary file, instead classfile static data. - wouldn't PreHashMaps be faster initialized as a normal HashMaps in j.l.Character.UnicodeScript and j.l.CharacterName? I don't think so.

Re: Unicode script support in Regex and Character class

2010-04-26 Thread Ulf Zibis
Am 26.04.2010 07:28, schrieb Xueming Shen: Can I assume we are all OK with at least the API part of the latest webrev/blenderrev of the script support in j.l.Character and j.u.r.Pattern, including the j.l.Chareacter.getName(). I guess you mean: public static enum UnicodeScript {

Re: Unicode script support in Regex and Character class

2010-04-25 Thread Xueming Shen
Can I assume we are all OK with at least the API part of the latest webrev/blenderrev of the script support in j.l.Character and j.u.r.Pattern, including the j.l.Chareacter.getName(). http://cr.openjdk.java.net/~sherman/script/blenderrev.html http://cr.openjdk.java.net/~sherman/script/webrev

Re: Unicode script support in Regex and Character class

2010-04-24 Thread Xueming Shen
Martin Buchholz wrote: Providing script support is obvious and non-controversial, because other regex programming environments provide it. Check that the behavior and syntax of the extension is consistent with e.g. ICU, python, and especially perl (5.12 just released!) http://perldoc.perl.org/pe

Re: Unicode script support in Regex and Character class

2010-04-24 Thread Martin Buchholz
Providing script support is obvious and non-controversial, because other regex programming environments provide it. Check that the behavior and syntax of the extension is consistent with e.g. ICU, python, and especially perl (5.12 just released!) http://perldoc.perl.org/perlunicode.html I would a

Re: Unicode script support in Regex and Character class

2010-04-23 Thread Ulf Zibis
Am 24.04.2010 01:09, schrieb Xueming Shen: Ulf Zibis wrote: - I like the idea, saving the data in a compressed binary file, instead classfile static data. - wouldn't PreHashMaps be faster initialized as a normal HashMaps in j.l.Character.UnicodeScript and j.l.CharacterName? I don't think so.

Re: Unicode script support in Regex and Character class

2010-04-23 Thread Xueming Shen
Ulf Zibis wrote: - I like the idea, saving the data in a compressed binary file, instead classfile static data. - wouldn't PreHashMaps be faster initialized as a normal HashMaps in j.l.Character.UnicodeScript and j.l.CharacterName? I don't think so. The key for these 2 cases is the whole unico

Re: Unicode script support in Regex and Character class

2010-04-22 Thread Xueming Shen
Yuri Gaevsky wrote: Hi Sherman, A couple of minor comments: - There is a typo (Uniocde) in Character.UnicodeScript.forName(java.lang.String): "Returns the UnicodeScript with the given Uniocde script name or the script name alias. " - Shouldn't the method be more specific i

Re: Unicode script support in Regex and Character class

2010-04-22 Thread Xueming Shen
Ulf Zibis wrote: (3) the syntax for script constructs. In addition to the "normal" \p{InScriptName} and \P{InScriptName} for the script support I'm also adding \p{script=ScriptName} \P{script=ScriptName} for the new script support \p{block=BlockName} \P{block=BlockName} for the "

Re: Unicode script support in Regex and Character class

2010-04-22 Thread Ulf Zibis
Am 22.04.2010 10:01, schrieb Xueming Shen: Hi, Here is the webrev of the proposal to add Unicode script support in regex and j.l.Character. http://cr.openjdk.java.net/~sherman/script/webrev and the corresponding blenderrev http://cr.openjdk.java.net/~sherman/script/blenderrev.html Please c