On Tue, Nov 5, 2013 at 5:38 AM, Steffen Daode <[email protected]> wrote:
> Hello, > ...i came to this solution in order to generate test data with > awk(1) in a memory-friendly way? > Comments like at the end of this line? 0009..000D ; White_space # Cc [5] <control>..<control> (The problem i'm facing is that _PRINT and _GRAPH cannot be set > for some properties from PropList.txt, say, _PRINT can't be set > for U+0009, CHARACTER TABULATION (ht), since it's a Cc, but in > order to know that i had to parse UnicodeData.txt and store > character information in memory first, (not thinking about further > options), but that requires a lot of memory, more than is > available on low-end machines.) > The comments are just that, comments, for human consumption, and their format may change without notice. One exception is the syntax in the @missing lines. It is normal that you need to parse multiple Unicode data files for extracting useful data. It also does not require "a lot of memory" considering how much memory is available even on ten-year-old clunkers at this point, unless you are especially extravagant with how you store the data. Besides, after parsing, you would normally build more compact data structures for the data you need. Having said that, if your parsing works with the files you see and the data you want to extract, then go for it. Just make sure that if the format changes, you have enough checks in your parser so that it fails with an error rather than silently producing garbage. You should also spot-check that the data you get from the comments does indeed match the real data. markus

