On Fri, Apr 25, 2014 at 6:05 AM, Steffen Nurpmeso <[email protected]>wrote:
> |What I tried to say is, if you need ID_Start, then parse ID_Start from > |DerivedCoreProperties.txt. That's more stable (and easier than parsing > the > |pieces and deriving > | > |# Lu + Ll + Lt + Lm + Lo + Nl > |# + Other_ID_Start > |# - Pattern_Syntax > |# - Pattern_White_Space > | > |yourself. > > But i *do* need to parse several many pieces (since i'm hardly > interested in ID_Start only)! > That's ok. Wherever there is a choice, parse the derived property rather than the pieces and doing your own derivation. So imho it's a bit like «Kraut und Rüben» («higgledy-piggledy» > sayy <http://www.dict.cc/?s=Kraut+und+R%C3%BCben>). > Ich weiß was das bedeutet :-) Wouldn't it make sense to introduce a single PropListsJoined.txt > that does it all. Depends. You could just parse the files you need. They don't have to be combined. I parse most of the UCD .txt files with a Python script and munge them into one combined file. Then I have C++ code that parses that. (Years ago I did parse the pieces and derive at runtime but found it tedious to follow the formula changes, and if the data structure eliminates redundancy, then the data size is about the same.) Unicode also publishes XML versions of the data, with most or all properties in a single file. (It's just not as convenient for me to parse XML in my tools, and the XML files were missing some pieces when I looked at them.) You could also just use a library that provides these properties, rather than roll your own. Shameless plug for ICU here which has most of the low-level properties in source code (from a generator), so no data loading for those. Ask the icu-support list <http://site.icu-project.org/contacts> for help if needed. ..and this is what i would do: offer a new file, say, Formula.txt, > which defines exactly the necessary formula, e.g., to quote your > example > It's not "my example". I copied that straight out of DerivedCoreProperties.txt. It's not worth writing a parser that handles all formulas (they are meant for human consumption) and derive their properties when you can just parse the derived property values. I don't know why there need to be megabytes of duplicated data. > It's easier to maintain the data in pieces, although we have to check the derived results as well. For implementers, the derived properties are the way to go. Ach; and i'm not gonna start to dream of better support for ISO > C / POSIX character classes. (Oh. ...It's surely sapless.) > http://www.unicode.org/reports/tr18/#Compatibility_Properties Viele Grüße, markus -- Google Internationalization Engineering
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

