Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
When an apostrophe is used to make an English noun a possessive, if the noun already ends with the letter s - the apostrophe is placed after the s. There are even some rare exceptions such as the singular noun cockatrice - in which the possessive just has an apostrophe but no letter s afterwards. Hint: search the KJV module. One method I’ve used in the past on some projects is to temporarily replace the apostrophe used to make a possessive by an unused letter from Latin-1 Supplement such as U+00FE Latin small letter Thorn. This is used in Old English, Icelandic and Phonetics, but not in modern English or Early Modern English. You have to know the rules for possessives - and you may still have instances where a closing single quotation mark could be mistaken for a possessive apostrophe unless the script takes account of the wider context. Another use for the apostrophe is to mark a missing syllable in a longer word. This is less likely to occur in Bibles but it might occur in some formal texts. And don’t get me started on works that were first digitised before the use of Unicode. Best regards, David Sent from [Proton Mail](https://proton.me/mail/home) for iOS On Tue, Dec 19, 2023 at 15:20, Nathan Phillip Brink <[ohnobi...@ohnopublishing.net](mailto:On Tue, Dec 19, 2023 at 15:20, Nathan Phillip Brink < wrote: > On 2023-12-19 04:26, Matěj Cepl wrote: > >> On Tue Dec 19, 2023 at 2:17 AM CET, Timothy Allen wrote: >> >>> 2. Apostrophes >>> >>> In English, the apostrophe used for possession (“the boy’s train”) and >>> omission (“don’t let’s start") is traditionally set with the same >>> character used as the closing single quote, so in any non-trivial >>> document there will almost certainly be more "closing single quotes" >>> than opening single quotes, it's not worth reporting on. >> >> Yes, I aware of it, and I feel very blessed that I don’t >> have this problem in Czech. I have no idea what to do with >> this without proper syntactic analysis, which is out of the >> question. Perhaps, running `re.sub(r'’s\b', '@#s', whole_text)` >> and then back, but it seems like a receipe for disaster. > > I think a better solution would be to make the script itself aware of when a > closing single quote is acting as a closing quote or not. If the closing > single quote is followed by an alphabetic character (it should be able to > test Unicode character classes for this), then it should be treated as an > apostrophe instead. I don’t know if biblical texts generally use > contractions, but your regular expression doesn’t handle contractions > generally. Also, I only know English and I am quite possibly missing some > edge cases. Some examples: > > - This isn’t a closing quote. (‘t’ is an alphabetic character) > - “I said, ‘This is a closing quote within a double-quoted phrase’”. (‘”’ > isn’t an alphabetic character) > >>> 3. Nested quotations >>> >>> In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell >>> other people that she was Abraham’s brother. In the BSB (and NIV, and >>> ESV, and NASB) this results in a triple-nested quotation. In English >>> typesetting conventions the outermost quotation gets double-quotes, the >>> second level gets single-quotes, and the third level gets double quotes >>> again. This causes the script to report an error: >>> >>> I couldn't immediately think of a way to get around this. >> >> Me neither. We should probably make effort for error recovery, so >> that the script would continue even after reporting a problem, >> but I am not sure how to do that either. > > The other approach would be checking what the counts are upon reaching a > terminating section. As mentioned below, in English, all quotes are > implicitly closed by the end of a paragraph. So any nonzero counts at the end > of a paragraph are OK. But when you encounter a closing quote, you can make > sure that the last opening quote is the same type of quote. If you store the > opening quote type in a stack, pop whenever you encounter a closing quote > while confirming a match, and report an error upon trying to pop an empty > stack or encountering an mismatched quote, and clear the stack upon reaching > a paragraph end, that would provide something useful for English. > >>> Another quirk that occurs to me is that in English typesetting, if one >>> person speaks multiple paragraphs (for example, the Sermon on the Mount) >>> then each paragraph gets an opening double-quote, but no closing >>> double-quote. That's going to play havoc with this kind of >>> quote-checking tool, too. >> >> Yes, we don’t do this in Czech, but it is typographically >> possible to just use paragraph indentation instead >> of quoting and of course we don’t have anything like >> indentation in the pure XML. I have just added quotes in >> the appropriate places and plan sending the patch to the >> Czech Biblical Society (after David reviews my fixes in >> https://gitlab.com/crosswire-bible-society/CzeC
Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
On 2023-12-19 04:26, Matěj Cepl wrote: On Tue Dec 19, 2023 at 2:17 AM CET, Timothy Allen wrote: 2. Apostrophes In English, the apostrophe used for possession (“the boy’s train”) and omission (“don’t let’s start") is traditionally set with the same character used as the closing single quote, so in any non-trivial document there will almost certainly be more "closing single quotes" than opening single quotes, it's not worth reporting on. Yes, I aware of it, and I feel very blessed that I don’t have this problem in Czech. I have no idea what to do with this without proper syntactic analysis, which is out of the question. Perhaps, running `re.sub(r'’s\b', '@#s', whole_text)` and then back, but it seems like a receipe for disaster. I think a better solution would be to make the script itself aware of when a closing single quote is acting as a closing quote or not. If the closing single quote is followed by an alphabetic character (it should be able to test Unicode character classes for this), then it should be treated as an apostrophe instead. I don’t know if biblical texts generally use contractions, but your regular expression doesn’t handle contractions generally. Also, I only know English and I am quite possibly missing some edge cases. Some examples: * This isn’t a closing quote. (‘t’ is an alphabetic character) * “I said, ‘This is a closing quote within a double-quoted phrase’”. (‘”’ isn’t an alphabetic character) 3. Nested quotations In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell other people that she was Abraham’s brother. In the BSB (and NIV, and ESV, and NASB) this results in a triple-nested quotation. In English typesetting conventions the outermost quotation gets double-quotes, the second level gets single-quotes, and the third level gets double quotes again. This causes the script to report an error: I couldn't immediately think of a way to get around this. Me neither. We should probably make effort for error recovery, so that the script would continue even after reporting a problem, but I am not sure how to do that either. The other approach would be checking what the counts are upon reaching a terminating section. As mentioned below, in English, all quotes are implicitly closed by the end of a paragraph. So any nonzero counts at the end of a paragraph are OK. But when you encounter a closing quote, you can make sure that the last opening quote is the same type of quote.If you store the opening quote type in a stack, pop whenever you encounter a closing quote while confirming a match, and report an error upon trying to pop an empty stack or encountering an mismatched quote, and clear the stack upon reaching a paragraph end, that would provide something useful for English. Another quirk that occurs to me is that in English typesetting, if one person speaks multiple paragraphs (for example, the Sermon on the Mount) then each paragraph gets an opening double-quote, but no closing double-quote. That's going to play havoc with this kind of quote-checking tool, too. Yes, we don’t do this in Czech, but it is typographically possible to just use paragraph indentation instead of quoting and of course we don’t have anything like indentation in the pure XML. I have just added quotes in the appropriate places and plan sending the patch to the Czech Biblical Society (after David reviews my fixes in https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2) with some other clear bugs I have found. See above. Unfortunately, it sounds like English speakers would want the script to be aware of different rules per-language, which definitely complicates things. But that would increase the utility in automatically identifying likely transcription errors. ___ sword-devel mailing list: sword-devel@crosswire.org http://crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
On Tue Dec 19, 2023 at 1:30 AM CET, Timothy Allen wrote: > As a data point, when I was writing scripts for manipulating and > updating the BSB module, I found the `xml.etree.ElementTree` module in > the Python standard library to be many times faster than the SAX API. > The SAX API is a perhaps a bit more convenient, because you can just > subscribe to whatever events are meaningful for whatever processing you > want to do, but ElementTree is just so much faster I found it was worth it. I have heard a good things about XMLPullParser, but I have never tried to use it in anger. I am not sure how the plain ElementTree (and I suppose you mean its `findall()` method) could help me here. My main focus is on the SAX `characters()` method, and here ElementTree with its `.text` and `.tail` attributes doesn’t help much, although if there was `findalltext()` method, it could get interesting. > LXML is probably faster again, but that's a third-party dependency, and > that adds enough hassle for people who aren't Python developers that I > drew the line there. Certainly, I am a big fan of writing just in the confines of the Python standard library. > If you've already written things using the SAX API that work well for > you, there's probably no point rewriting them, but if you're writing > more tools in the future, you might want to give it a try! I know ElementTree and it is very useful for simple searches in the XML tree, but I am not sure how it would help with this project or with my CzeCSP conversion. Blessings, Matěj -- http://matej.ceplovi.cz/blog/, @mcepl@floss.social GPG Finger: 3C76 A027 CA45 AD70 98B5 BC1D 7920 5802 880B C9D8 “Anything essential is invisible to the eyes”, the little prince repeated, in order to remember. “It’s the time you spent on your rose that makes your rose so important.” “It’s the time I spent on my rose …,” the little prince repeated, in order to remember. “People have forgotten this truth.” the fox said. “But you mustn’t forget it. You become responsible forever for what you’ve tamed. You’re responsible for your rose…” “I’m responsible for my rose…,” the little prince repeated, in order to remember. -- Antoine de Saint-Exupéry: The Little Prince signature.asc Description: PGP signature ___ sword-devel mailing list: sword-devel@crosswire.org http://crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
On Tue Dec 19, 2023 at 2:17 AM CET, Timothy Allen wrote: > I tried running it over my BSB module, and I hit problems fairly > quickly, some of which are more easily solved than others. > > 1. No support for language “en” > > This was easy enough to handle, there's a configuration variable near > the top of the file that lets you configure which quotes are used for > which languages. Patch sent to my email would be welcome. > 2. Apostrophes > > In English, the apostrophe used for possession (“the boy’s train”) and > omission (“don’t let’s start") is traditionally set with the same > character used as the closing single quote, so in any non-trivial > document there will almost certainly be more "closing single quotes" > than opening single quotes, it's not worth reporting on. Yes, I aware of it, and I feel very blessed that I don’t have this problem in Czech. I have no idea what to do with this without proper syntactic analysis, which is out of the question. Perhaps, running `re.sub(r'’s\b', '@#s', whole_text)` and then back, but it seems like a receipe for disaster. > 3. Nested quotations > > In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell > other people that she was Abraham’s brother. In the BSB (and NIV, and > ESV, and NASB) this results in a triple-nested quotation. In English > typesetting conventions the outermost quotation gets double-quotes, the > second level gets single-quotes, and the third level gets double quotes > again. This causes the script to report an error: > > I couldn't immediately think of a way to get around this. Me neither. We should probably make effort for error recovery, so that the script would continue even after reporting a problem, but I am not sure how to do that either. > Another quirk that occurs to me is that in English typesetting, if one > person speaks multiple paragraphs (for example, the Sermon on the Mount) > then each paragraph gets an opening double-quote, but no closing > double-quote. That's going to play havoc with this kind of > quote-checking tool, too. Yes, we don’t do this in Czech, but it is typographically possible to just use paragraph indentation instead of quoting and of course we don’t have anything like indentation in the pure XML. I have just added quotes in the appropriate places and plan sending the patch to the Czech Biblical Society (after David reviews my fixes in https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2) with some other clear bugs I have found. > Perhaps this kind of tool just isn't suited to checking English text... > but I'm sure there's other languages with more sensible conventions that > it could help with. Good luck with it! With https://gitlab.com/crosswire-bible-society/CzeCEP/-/merge_requests/4/diffs I have managed to make CzeCEP behave. Now I will try other Czech modules. Blessings, Matěj -- http://matej.ceplovi.cz/blog/, @mcepl@floss.social GPG Finger: 3C76 A027 CA45 AD70 98B5 BC1D 7920 5802 880B C9D8 Power tends to corrupt and absolute power corrupts absolutely. Great men are almost always bad men, […] -- Lord Acton (including the more important part of the often misquoted statement) signature.asc Description: PGP signature ___ sword-devel mailing list: sword-devel@crosswire.org http://crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
On 19/12/23 00:06, Matěj Cepl wrote: I have decided not to rely on very kind help by David with his Windows tools and I have written (hopefully) completely platform neutral pure Python 3 script for checking pairwise-characters. So, far it was used only for fixing https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2 and I am quite sure it is pretty buggy, but it could be proven useful for somebody. Thank you for doing this work! This seems like it could be a useful tool for validating texts of all kinds. I tried running it over my BSB module, and I hit problems fairly quickly, some of which are more easily solved than others. 1. No support for language “en” This was easy enough to handle, there's a configuration variable near the top of the file that lets you configure which quotes are used for which languages. 2. Apostrophes In English, the apostrophe used for possession (“the boy’s train”) and omission (“don’t let’s start") is traditionally set with the same character used as the closing single quote, so in any non-trivial document there will almost certainly be more "closing single quotes" than opening single quotes, it's not worth reporting on. I got around this by just deleting single quotes from the configuration. 3. Nested quotations In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell other people that she was Abraham’s brother. In the BSB (and NIV, and ESV, and NASB) this results in a triple-nested quotation. In English typesetting conventions the outermost quotation gets double-quotes, the second level gets single-quotes, and the third level gets double quotes again. This causes the script to report an error: Balance for character “ is over one in Gen.20.13 I couldn't immediately think of a way to get around this. Another quirk that occurs to me is that in English typesetting, if one person speaks multiple paragraphs (for example, the Sermon on the Mount) then each paragraph gets an opening double-quote, but no closing double-quote. That's going to play havoc with this kind of quote-checking tool, too. Perhaps this kind of tool just isn't suited to checking English text... but I'm sure there's other languages with more sensible conventions that it could help with. Good luck with it! Timothy. ___ sword-devel mailing list: sword-devel@crosswire.org http://crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
On 19/12/23 01:45, Matěj Cepl wrote: 2. I use SAX API (xml.sax from the standard library) and it seems to me like better suited for the Bible processing than the traditional DOM (or LXML) interface. It nicely hides away all hard work going on in the background and let me work only on what’s relevant to my task. As a data point, when I was writing scripts for manipulating and updating the BSB module, I found the `xml.etree.ElementTree` module in the Python standard library to be many times faster than the SAX API. The SAX API is a perhaps a bit more convenient, because you can just subscribe to whatever events are meaningful for whatever processing you want to do, but ElementTree is just so much faster I found it was worth it. LXML is probably faster again, but that's a third-party dependency, and that adds enough hassle for people who aren't Python developers that I drew the line there. If you've already written things using the SAX API that work well for you, there's probably no point rewriting them, but if you're writing more tools in the future, you might want to give it a try! Timothy ___ sword-devel mailing list: sword-devel@crosswire.org http://crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
Ok, all good then, we are covered, this is a different use case. On Mon, Dec 18, 2023 at 3:46 PM Matěj Cepl wrote: > On Mon Dec 18, 2023 at 2:38 PM CET, Kristof Szabo wrote: > > I wrote some time back https://github.com/krisek/sword-test, with quite > a > > few test cases, which, I think, covers your use case as well. > > Couple of differences on the first look: > > 1. Functionally, I prefer my script which stops when the first >unpaired character is found, thus allowing fixing the problem. > 2. I use SAX API (xml.sax from the standard library) and it seems >to me like better suited for the Bible processing than the >traditional DOM (or LXML) interface. It nicely hides away all >hard work going on in the background and let me work only on >what’s relevant to my task. See > > https://gitlab.com/crosswire-bible-society/CzeCSP/-/blob/master/CEPtoOSIS.py >for an example of much more complicated processing (and also, >it is ten-fold or something like that faster than processing >with Java and Saxon/XSLT). > > > > Temporarily the script is in its own repo > > > (https://gitlab.com/mcepl/bible-freq-counter) and attached to > > > this message, but I would like to submit it to sword-utils. How > > > to do it? > > Just an update … I have moved the script to > https://git.crosswire.org/mcepl/bible-freq-counter. > > Best, > > Matěj > > -- > http://matej.ceplovi.cz/blog/, @mcepl@floss.social > GPG Finger: 3C76 A027 CA45 AD70 98B5 BC1D 7920 5802 880B C9D8 > > Nemo plus iuris ad alium transfere potest quam ipse habet. > ___ > sword-devel mailing list: sword-devel@crosswire.org > http://crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > ___ sword-devel mailing list: sword-devel@crosswire.org http://crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
On Mon Dec 18, 2023 at 2:38 PM CET, Kristof Szabo wrote: > I wrote some time back https://github.com/krisek/sword-test, with quite a > few test cases, which, I think, covers your use case as well. Couple of differences on the first look: 1. Functionally, I prefer my script which stops when the first unpaired character is found, thus allowing fixing the problem. 2. I use SAX API (xml.sax from the standard library) and it seems to me like better suited for the Bible processing than the traditional DOM (or LXML) interface. It nicely hides away all hard work going on in the background and let me work only on what’s relevant to my task. See https://gitlab.com/crosswire-bible-society/CzeCSP/-/blob/master/CEPtoOSIS.py for an example of much more complicated processing (and also, it is ten-fold or something like that faster than processing with Java and Saxon/XSLT). > > Temporarily the script is in its own repo > > (https://gitlab.com/mcepl/bible-freq-counter) and attached to > > this message, but I would like to submit it to sword-utils. How > > to do it? Just an update … I have moved the script to https://git.crosswire.org/mcepl/bible-freq-counter. Best, Matěj -- http://matej.ceplovi.cz/blog/, @mcepl@floss.social GPG Finger: 3C76 A027 CA45 AD70 98B5 BC1D 7920 5802 880B C9D8 Nemo plus iuris ad alium transfere potest quam ipse habet. signature.asc Description: PGP signature ___ sword-devel mailing list: sword-devel@crosswire.org http://crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
Hi Matěj, I wrote some time back https://github.com/krisek/sword-test, with quite a few test cases, which, I think, covers your use case as well. I was in touch with Dom on this at the time, but somehow the discussion stopped how to include these in the module build pipeline. If you think it is useful I can check your code in detail, and we can see how we can include it in sword-test. I think the purpose of sword-utils is something else, and I would be surprised if python code was incorporated there. (If that was the case, I'd give a go with sword-test too :)) Kind regards, Kristof On Mon, Dec 18, 2023 at 2:07 PM Matěj Cepl wrote: > Hello, > > I have decided not to rely on very kind help by David > with his Windows tools and I have written (hopefully) > completely platform neutral pure Python 3 script for checking > pairwise-characters. So, far it was used only for fixing > https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2 and > I am quite sure it is pretty buggy, but it could be proven useful > for somebody. > > Temporarily the script is in its own repo > (https://gitlab.com/mcepl/bible-freq-counter) and attached to > this message, but I would like to submit it to sword-utils. How > to do it? > > Blessings, > > Matěj > -- > http://matej.ceplovi.cz/blog/, @mcepl@floss.social > GPG Finger: 3C76 A027 CA45 AD70 98B5 BC1D 7920 5802 880B C9D8 > > Afraid to die alone? > Become a bus driver. > -- alleged easter egg in notepad++ > ___ > sword-devel mailing list: sword-devel@crosswire.org > http://crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > ___ sword-devel mailing list: sword-devel@crosswire.org http://crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page