subject:"Re\: \[sword\-devel\] Python script for checking pairwise characters \(PROFF\-OF\-CONCEPT\)"

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

2023-12-19 Thread David Haslam

When an apostrophe is used to make an English noun a possessive, if the noun 
already ends with the letter s - the apostrophe is placed after the s.

There are even some rare exceptions such as the singular noun cockatrice - in 
which the possessive just has an apostrophe but no letter s afterwards. Hint: 
search the KJV module.

One method I’ve used in the past on some projects is to temporarily replace the 
apostrophe used to make a possessive by an unused letter from Latin-1 
Supplement such as U+00FE Latin small letter Thorn.

This is used in Old English, Icelandic and Phonetics, but not in modern English 
or Early Modern English.

You have to know the rules for possessives - and you may still have instances 
where a closing single quotation mark could be mistaken for a possessive 
apostrophe unless the script takes account of the wider context.

Another use for the apostrophe is to mark a missing syllable in a longer word. 
This is less likely to occur in Bibles but it might occur in some formal texts.

And don’t get me started on works that were first digitised before the use of 
Unicode.

Best regards,

David

Sent from [Proton Mail](https://proton.me/mail/home) for iOS

On Tue, Dec 19, 2023 at 15:20, Nathan Phillip Brink 
<[ohnobi...@ohnopublishing.net](mailto:On Tue, Dec 19, 2023 at 15:20, Nathan 
Phillip Brink < wrote:

> On 2023-12-19 04:26, Matěj Cepl wrote:
>
>> On Tue Dec 19, 2023 at 2:17 AM CET, Timothy Allen wrote:
>>
>>> 2. Apostrophes
>>>
>>> In English, the apostrophe used for possession (“the boy’s train”) and
>>> omission (“don’t let’s start") is traditionally set with the same
>>> character used as the closing single quote, so in any non-trivial
>>> document there will almost certainly be more "closing single quotes"
>>> than opening single quotes, it's not worth reporting on.
>>
>> Yes, I aware of it, and I feel very blessed that I don’t
>> have this problem in Czech. I have no idea what to do with
>> this without proper syntactic analysis, which is out of the
>> question. Perhaps, running `re.sub(r'’s\b', '@#s', whole_text)`
>> and then back, but it seems like a receipe for disaster.
>
> I think a better solution would be to make the script itself aware of when a 
> closing single quote is acting as a closing quote or not. If the closing 
> single quote is followed by an alphabetic character (it should be able to 
> test Unicode character classes for this), then it should be treated as an 
> apostrophe instead. I don’t know if biblical texts generally use 
> contractions, but your regular expression doesn’t handle contractions 
> generally. Also, I only know English and I am quite possibly missing some 
> edge cases. Some examples:
>
> - This isn’t a closing quote. (‘t’ is an alphabetic character)
> - “I said, ‘This is a closing quote within a double-quoted phrase’”. (‘”’ 
> isn’t an alphabetic character)
>
>>> 3. Nested quotations
>>>
>>> In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell
>>> other people that she was Abraham’s brother. In the BSB (and NIV, and
>>> ESV, and NASB) this results in a triple-nested quotation. In English
>>> typesetting conventions the outermost quotation gets double-quotes, the
>>> second level gets single-quotes, and the third level gets double quotes
>>> again. This causes the script to report an error:
>>>
>>> I couldn't immediately think of a way to get around this.
>>
>> Me neither. We should probably make effort for error recovery, so
>> that the script would continue even after reporting a problem,
>> but I am not sure how to do that either.
>
> The other approach would be checking what the counts are upon reaching a 
> terminating section. As mentioned below, in English, all quotes are 
> implicitly closed by the end of a paragraph. So any nonzero counts at the end 
> of a paragraph are OK. But when you encounter a closing quote, you can make 
> sure that the last opening quote is the same type of quote. If you store the 
> opening quote type in a stack, pop whenever you encounter a closing quote 
> while confirming a match, and report an error upon trying to pop an empty 
> stack or encountering an mismatched quote, and clear the stack upon reaching 
> a paragraph end, that would provide something useful for English.
>
>>> Another quirk that occurs to me is that in English typesetting, if one
>>> person speaks multiple paragraphs (for example, the Sermon on the Mount)
>>> then each paragraph gets an opening double-quote, but no closing
>>> double-quote. That's going to play havoc with this kind of
>>> quote-checking tool, too.
>>
>> Yes, we don’t do this in Czech, but it is typographically
>> possible to just use paragraph indentation instead
>> of quoting and of course we don’t have anything like
>> indentation in the pure XML. I have just added quotes in
>> the appropriate places and plan sending the patch to the
>> Czech Biblical Society (after David reviews my fixes in
>> https://gitlab.com/crosswire-bible-society/CzeC

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

2023-12-19 Thread Nathan Phillip Brink


On 2023-12-19 04:26, Matěj Cepl wrote:

On Tue Dec 19, 2023 at 2:17 AM CET, Timothy Allen wrote:

2. Apostrophes

In English, the apostrophe used for possession (“the boy’s train”) and
omission (“don’t let’s start") is traditionally set with the same
character used as the closing single quote, so in any non-trivial
document there will almost certainly be more "closing single quotes"
than opening single quotes, it's not worth reporting on.

Yes, I aware of it, and I feel very blessed that I don’t
have this problem in Czech. I have no idea what to do with
this without proper syntactic analysis, which is out of the
question. Perhaps, running `re.sub(r'’s\b', '@#s', whole_text)`
and then back, but it seems like a receipe for disaster.


I think a better solution would be to make the script itself aware of 
when a closing single quote is acting as a closing quote or not. If the 
closing single quote is followed by an alphabetic character (it should 
be able to test Unicode character classes for this), then it should be 
treated as an apostrophe instead. I don’t know if biblical texts 
generally use contractions, but your regular expression doesn’t handle 
contractions generally. Also, I only know English and I am quite 
possibly missing some edge cases. Some examples:


 * This isn’t a closing quote. (‘t’ is an alphabetic character)
 * “I said, ‘This is a closing quote within a double-quoted phrase’”.
   (‘”’ isn’t an alphabetic character)


3. Nested quotations

In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell
other people that she was Abraham’s brother. In the BSB (and NIV, and
ESV, and NASB) this results in a triple-nested quotation. In English
typesetting conventions the outermost quotation gets double-quotes, the
second level gets single-quotes, and the third level gets double quotes
again. This causes the script to report an error:

I couldn't immediately think of a way to get around this.

Me neither. We should probably make effort for error recovery, so
that the script would continue even after reporting a problem,
but I am not sure how to do that either.
The other approach would be checking what the counts are upon reaching a 
terminating section. As mentioned below, in English, all quotes are 
implicitly closed by the end of a paragraph. So any nonzero counts at 
the end of a paragraph are OK. But when you encounter a closing quote, 
you can make sure that the last opening quote is the same type of 
quote.If you store the opening quote type in a stack, pop whenever you 
encounter a closing quote while confirming a match, and report an error 
upon trying to pop an empty stack or encountering an mismatched quote, 
and clear the stack upon reaching a paragraph end, that would provide 
something useful for English.

Another quirk that occurs to me is that in English typesetting, if one
person speaks multiple paragraphs (for example, the Sermon on the Mount)
then each paragraph gets an opening double-quote, but no closing
double-quote. That's going to play havoc with this kind of
quote-checking tool, too.

Yes, we don’t do this in Czech, but it is typographically
possible to just use paragraph indentation instead
of quoting and of course we don’t have anything like
indentation in the pure XML. I have just added quotes in
the appropriate places and plan sending the patch to the
Czech Biblical Society (after David reviews my fixes in
https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2)
with some other clear bugs I have found.


See above.

Unfortunately, it sounds like English speakers would want the script to 
be aware of different rules per-language, which definitely complicates 
things. But that would increase the utility in automatically identifying 
likely transcription errors.
___
sword-devel mailing list: sword-devel@crosswire.org
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

2023-12-19 Thread Matěj Cepl

On Tue Dec 19, 2023 at 1:30 AM CET, Timothy Allen wrote:
> As a data point, when I was writing scripts for manipulating and 
> updating the BSB module, I found the `xml.etree.ElementTree` module in 
> the Python standard library to be many times faster than the SAX API. 
> The SAX API is a perhaps a bit more convenient, because you can just 
> subscribe to whatever events are meaningful for whatever processing you 
> want to do, but ElementTree is just so much faster I found it was worth it.

I have heard a good things about XMLPullParser, but I have never
tried to use it in anger.

I am not sure how the plain ElementTree (and I suppose you mean
its `findall()` method) could help me here. My main focus is on
the SAX `characters()` method, and here ElementTree with its
`.text` and `.tail` attributes doesn’t help much, although if
there was `findalltext()` method, it could get interesting.

> LXML is probably faster again, but that's a third-party dependency, and 
> that adds enough hassle for people who aren't Python developers that I 
> drew the line there.

Certainly, I am a big fan of writing just in the confines of the
Python standard library.

> If you've already written things using the SAX API that work well for 
> you, there's probably no point rewriting them, but if you're writing 
> more tools in the future, you might want to give it a try!

I know ElementTree and it is very useful for simple searches
in the XML tree, but I am not sure how it would help with this
project or with my CzeCSP conversion.

Blessings,

Matěj

-- 
http://matej.ceplovi.cz/blog/, @mcepl@floss.social
GPG Finger: 3C76 A027 CA45 AD70 98B5  BC1D 7920 5802 880B C9D8
 
“Anything essential is invisible to the eyes”, the little prince
repeated, in order to remember.
“It’s the time you spent on your rose that makes your rose so
important.”
“It’s the time I spent on my rose …,” the little prince
repeated, in order to remember.
“People have forgotten this truth.” the fox said. “But you
mustn’t forget it.  You become responsible forever for what
you’ve tamed. You’re responsible for your rose…”
“I’m responsible for my rose…,” the little prince repeated, in
order to remember.
-- Antoine de Saint-Exupéry: The Little Prince


signature.asc
Description: PGP signature
___
sword-devel mailing list: sword-devel@crosswire.org
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

2023-12-19 Thread Matěj Cepl

On Tue Dec 19, 2023 at 2:17 AM CET, Timothy Allen wrote:
> I tried running it over my BSB module, and I hit problems fairly 
> quickly, some of which are more easily solved than others.
>
> 1. No support for language “en”
>
> This was easy enough to handle, there's a configuration variable near 
> the top of the file that lets you configure which quotes are used for 
> which languages.

Patch sent to my email would be welcome.

> 2. Apostrophes
>
> In English, the apostrophe used for possession (“the boy’s train”) and 
> omission (“don’t let’s start") is traditionally set with the same 
> character used as the closing single quote, so in any non-trivial 
> document there will almost certainly be more "closing single quotes" 
> than opening single quotes, it's not worth reporting on.

Yes, I aware of it, and I feel very blessed that I don’t
have this problem in Czech. I have no idea what to do with
this without proper syntactic analysis, which is out of the
question. Perhaps, running `re.sub(r'’s\b', '@#s', whole_text)`
and then back, but it seems like a receipe for disaster.

> 3. Nested quotations
>
> In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell 
> other people that she was Abraham’s brother. In the BSB (and NIV, and 
> ESV, and NASB) this results in a triple-nested quotation. In English 
> typesetting conventions the outermost quotation gets double-quotes, the 
> second level gets single-quotes, and the third level gets double quotes 
> again. This causes the script to report an error:
>
> I couldn't immediately think of a way to get around this.

Me neither. We should probably make effort for error recovery, so
that the script would continue even after reporting a problem,
but I am not sure how to do that either.

> Another quirk that occurs to me is that in English typesetting, if one 
> person speaks multiple paragraphs (for example, the Sermon on the Mount) 
> then each paragraph gets an opening double-quote, but no closing 
> double-quote. That's going to play havoc with this kind of 
> quote-checking tool, too.

Yes, we don’t do this in Czech, but it is typographically
possible to just use paragraph indentation instead
of quoting and of course we don’t have anything like
indentation in the pure XML. I have just added quotes in
the appropriate places and plan sending the patch to the
Czech Biblical Society (after David reviews my fixes in
https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2)
with some other clear bugs I have found.

> Perhaps this kind of tool just isn't suited to checking English text... 
> but I'm sure there's other languages with more sensible conventions that 
> it could help with. Good luck with it!

With https://gitlab.com/crosswire-bible-society/CzeCEP/-/merge_requests/4/diffs
I have managed to make CzeCEP behave. Now I will try other Czech modules.

Blessings,

Matěj

-- 
http://matej.ceplovi.cz/blog/, @mcepl@floss.social
GPG Finger: 3C76 A027 CA45 AD70 98B5  BC1D 7920 5802 880B C9D8

Power tends to corrupt and absolute power corrupts
absolutely. Great men are almost always bad men, […]
  -- Lord Acton (including the more important part of the often
 misquoted statement)

signature.asc
Description: PGP signature
___
sword-devel mailing list: sword-devel@crosswire.org
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

2023-12-18 Thread Timothy Allen


On 19/12/23 00:06, Matěj Cepl wrote:

I have decided not to rely on very kind help by David
with his Windows tools and I have written (hopefully)
completely platform neutral pure Python 3 script for checking
pairwise-characters. So, far it was used only for fixing
https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2  and
I am quite sure it is pretty buggy, but it could be proven useful
for somebody.


Thank you for doing this work! This seems like it could be a useful tool 
for validating texts of all kinds.


I tried running it over my BSB module, and I hit problems fairly 
quickly, some of which are more easily solved than others.


1. No support for language “en”

This was easy enough to handle, there's a configuration variable near 
the top of the file that lets you configure which quotes are used for 
which languages.


2. Apostrophes

In English, the apostrophe used for possession (“the boy’s train”) and 
omission (“don’t let’s start") is traditionally set with the same 
character used as the closing single quote, so in any non-trivial 
document there will almost certainly be more "closing single quotes" 
than opening single quotes, it's not worth reporting on.


I got around this by just deleting single quotes from the configuration.

3. Nested quotations

In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell 
other people that she was Abraham’s brother. In the BSB (and NIV, and 
ESV, and NASB) this results in a triple-nested quotation. In English 
typesetting conventions the outermost quotation gets double-quotes, the 
second level gets single-quotes, and the third level gets double quotes 
again. This causes the script to report an error:



Balance for  character “ is over one in Gen.20.13


I couldn't immediately think of a way to get around this.

Another quirk that occurs to me is that in English typesetting, if one 
person speaks multiple paragraphs (for example, the Sermon on the Mount) 
then each paragraph gets an opening double-quote, but no closing 
double-quote. That's going to play havoc with this kind of 
quote-checking tool, too.


Perhaps this kind of tool just isn't suited to checking English text... 
but I'm sure there's other languages with more sensible conventions that 
it could help with. Good luck with it!



Timothy.
___
sword-devel mailing list: sword-devel@crosswire.org
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

2023-12-18 Thread Timothy Allen


On 19/12/23 01:45, Matěj Cepl wrote:

2. I use SAX API (xml.sax from the standard library) and it seems
to me like better suited for the Bible processing than the
traditional DOM (or LXML) interface. It nicely hides away all
hard work going on in the background and let me work only on
what’s relevant to my task.


As a data point, when I was writing scripts for manipulating and 
updating the BSB module, I found the `xml.etree.ElementTree` module in 
the Python standard library to be many times faster than the SAX API. 
The SAX API is a perhaps a bit more convenient, because you can just 
subscribe to whatever events are meaningful for whatever processing you 
want to do, but ElementTree is just so much faster I found it was worth it.


LXML is probably faster again, but that's a third-party dependency, and 
that adds enough hassle for people who aren't Python developers that I 
drew the line there.


If you've already written things using the SAX API that work well for 
you, there's probably no point rewriting them, but if you're writing 
more tools in the future, you might want to give it a try!



Timothy
___
sword-devel mailing list: sword-devel@crosswire.org
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

2023-12-18 Thread Kristof Szabo

Ok, all good then, we are covered, this is a different use case.



On Mon, Dec 18, 2023 at 3:46 PM Matěj Cepl  wrote:

> On Mon Dec 18, 2023 at 2:38 PM CET, Kristof Szabo wrote:
> > I wrote some time back https://github.com/krisek/sword-test, with quite
> a
> > few test cases, which, I think, covers your use case as well.
>
> Couple of differences on the first look:
>
> 1. Functionally, I prefer my script which stops when the first
>unpaired character is found, thus allowing fixing the problem.
> 2. I use SAX API (xml.sax from the standard library) and it seems
>to me like better suited for the Bible processing than the
>traditional DOM (or LXML) interface. It nicely hides away all
>hard work going on in the background and let me work only on
>what’s relevant to my task. See
>
> https://gitlab.com/crosswire-bible-society/CzeCSP/-/blob/master/CEPtoOSIS.py
>for an example of much more complicated processing (and also,
>it is ten-fold or something like that faster than processing
>with Java and Saxon/XSLT).
>
> > > Temporarily the script is in its own repo
> > > (https://gitlab.com/mcepl/bible-freq-counter) and attached to
> > > this message, but I would like to submit it to sword-utils. How
> > > to do it?
>
> Just an update … I have moved the script to
> https://git.crosswire.org/mcepl/bible-freq-counter.
>
> Best,
>
> Matěj
>
> --
> http://matej.ceplovi.cz/blog/, @mcepl@floss.social
> GPG Finger: 3C76 A027 CA45 AD70 98B5  BC1D 7920 5802 880B C9D8
>
> Nemo plus iuris ad alium transfere potest quam ipse habet.
> ___
> sword-devel mailing list: sword-devel@crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
___
sword-devel mailing list: sword-devel@crosswire.org
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

2023-12-18 Thread Matěj Cepl

On Mon Dec 18, 2023 at 2:38 PM CET, Kristof Szabo wrote:
> I wrote some time back https://github.com/krisek/sword-test, with quite a
> few test cases, which, I think, covers your use case as well.

Couple of differences on the first look:

1. Functionally, I prefer my script which stops when the first
   unpaired character is found, thus allowing fixing the problem.
2. I use SAX API (xml.sax from the standard library) and it seems
   to me like better suited for the Bible processing than the
   traditional DOM (or LXML) interface. It nicely hides away all
   hard work going on in the background and let me work only on
   what’s relevant to my task. See
   https://gitlab.com/crosswire-bible-society/CzeCSP/-/blob/master/CEPtoOSIS.py
   for an example of much more complicated processing (and also,
   it is ten-fold or something like that faster than processing
   with Java and Saxon/XSLT).

> > Temporarily the script is in its own repo
> > (https://gitlab.com/mcepl/bible-freq-counter) and attached to
> > this message, but I would like to submit it to sword-utils. How
> > to do it?

Just an update … I have moved the script to
https://git.crosswire.org/mcepl/bible-freq-counter.

Best,

Matěj

-- 
http://matej.ceplovi.cz/blog/, @mcepl@floss.social
GPG Finger: 3C76 A027 CA45 AD70 98B5  BC1D 7920 5802 880B C9D8

Nemo plus iuris ad alium transfere potest quam ipse habet.

signature.asc
Description: PGP signature
___
sword-devel mailing list: sword-devel@crosswire.org
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

2023-12-18 Thread Kristof Szabo

Hi Matěj,

I wrote some time back https://github.com/krisek/sword-test, with quite a
few test cases, which, I think, covers your use case as well.

I was in touch with Dom on this at the time, but somehow the discussion
stopped how to include these in the module build pipeline.

If you think it is useful I can check your code in detail, and we can see
how we can include it in sword-test.

I think the purpose of sword-utils is something else, and I would be
surprised if python code was incorporated there. (If that was the case, I'd
give a go with sword-test too :))

Kind regards,
Kristof

On Mon, Dec 18, 2023 at 2:07 PM Matěj Cepl  wrote:

> Hello,
>
> I have decided not to rely on very kind help by David
> with his Windows tools and I have written (hopefully)
> completely platform neutral pure Python 3 script for checking
> pairwise-characters. So, far it was used only for fixing
> https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2 and
> I am quite sure it is pretty buggy, but it could be proven useful
> for somebody.
>
> Temporarily the script is in its own repo
> (https://gitlab.com/mcepl/bible-freq-counter) and attached to
> this message, but I would like to submit it to sword-utils. How
> to do it?
>
> Blessings,
>
> Matěj
> --
> http://matej.ceplovi.cz/blog/, @mcepl@floss.social
> GPG Finger: 3C76 A027 CA45 AD70 98B5  BC1D 7920 5802 880B C9D8
>
> Afraid to die alone?
> Become a bus driver.
>   -- alleged easter egg in notepad++
> ___
> sword-devel mailing list: sword-devel@crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
___
sword-devel mailing list: sword-devel@crosswire.org
http://crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Re: [sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

9 matches

Site Navigation

Mail list logo

Footer information