On 01/02/2016 10:48 AM, Ryan Hiebert
wrote:
I'm completely new to USFM, so I'm sketching out my ideas on how the parser probably should look. This is unearthing some of the many things I don't understand about USFM, so I'll post my questions here. Feel free to forward me to a better forum if there is one.These questions are all related: 1. Is text allowed to be on a line _without_ a marker starting the line? Yes. Newline and space are equivalent. 2. Are blank lines semantically meaningful? No. That is, if all the blank lines are removed, does the file mean _exactly_ the same thing? Yes. Two or more consecutive white spaces are the same as one white space. A white space can be space, tab, or newline. No.3. Are the non-text markers (one that don't have the ending form( \usfm* ) required at the beginning of all meaningful lines? Note that there are FOUR classes of markers, not just two, the way I parse them (which is regularly tested against Paratext output): 1: Starts and beginning of a line (normally), and indicates a paragraph or metadata. Its effects extend until the next such marker. Examples: \id, \c, \v, \p, \q1. 2: Footnote/cross reference styles, non-nestable, terminated by the next such style. For historical reasons, these can also be terminated by an end marker like the next case, so when reading, allow either syntax. Examples: \fr, \ft or \ft ...\ft*, \fqa. 3: Normal character markers with both beginning and ending markers, with the end marker the same as the beginning marker but ending with "*". These are not allowed to be nested or nested within the above style markers. Examples: \nd ...\nd*, \wj ...\wj*. 4: Nested character markers start with "\+" and terminate with the same marker ended by "*". These are otherwise the same as #3, but cannot occur unless they are inside of a style of case #2 or #3. Examples: \+nd ...\+nd*, \+wj ...\+wj* The class numbers above aren't in the USFM specification, but the concepts are both there and in the master reference implementation of USFM, which is Paratext. Sometimes Paratext produces USFM files where markers of the first kind can be in other positions than the beginning of a line. When writing USFM, put them at the beginning of a line. When reading USFM, be more tolerant. No, but when writing USFM, class #1 markers, put them at the beginning of a line.4. Is only one non-text marker allowed per line? 5. Must a non-text marker be only at the beginning of a line? That is best practice. Always write them there if you are writing, but allow them elsewhere if you are reading. Thanks for any help you can give with assisting me in sorting this out. I'm obviously completely new to USFM, so I don't know what I don't know. Take a look at some test cases from http://ebible.org/Scriptures/, files ending in _usfm.zip. Also, if you want to read some C# code, you can check out the Haiola source code for how I parse USFM. Also, one word of caution: There is no proper way to do a one-to-one, lossless, round tripable correspondence between USFM to OSIS. -- Aloha,
|
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page