Costas,
        A few comments...

Costas Stergiou wrote:
Hi David/Troy,
looking at the texts, I think there is some work to be done:
- remove any combining diacriticals & process everything as precomposed.

I think this is backwards. From my limited understanding and from reading recent posts on sword-devel from people with much more knowledge than me, I think the text should be stored with no precomposed characters. If the renderer needs to send precomposed characters to the display control, then it (sword can do this with an ICU filter, I think) can precompose them.



- remove common mistakes found by text processing (e.g. wrong letters)
- fix missing spaces between some words
- compare words with other accented texts to find other errors, etc.

Right now I am working on an accented greek text (a byzantine one) which
looks very very good. It is supposed to be the official greek text used by
the eastern orthodox church. Actually, it is very close to the byzantine (at
some times with the TR). I also have a printed version of it, and it does
seem very good. I got it from http://kainh.homestead.com.

At the same time, I have been working on some other accented greek texts
also.
What I think is that by having all those accented texts, maybe I could right
a util that takes almost any unaccented greek text, looking at it verse by
verse and adding diacriticals by using the various accented versions I have.
I am not sure that this is feasible but when I look at the differences
between the texts, i realize they are very small and most (if not all of
them) can be found programmaticaly.

About the WH you send me: i would like to test it through all the various
scripts i have and make any corrections taking my time.

One think is important here:
All the above can only happen on the texts WITHOUT the strong & morph tags.
So, I suppose, we need to find a generic way of adding these later. I think
it is not difficult, but since I don't know the specifics I cannot tell.

You can iterate thru the text without Strong's with a very simple routine:


SWMgr swordLibrary;
swordLibrary.setGlobalOption("Strong's Numbers", "Off");
SWModule *whac = swordLibrary.Modules["WHAC"];
for ((*whac) = TOP; !whac->Error(); (*whac)++) {
        cout << whac->RenderText();
}


I would suggest using your scripts to find errors with the above code, then correcting the error in the module with strongs/morph. You can export the module with mod2osis or mod2imp-- whichever is easier for you to work with. Then you can import it back with osis2mod or imp2vs.



Thanks for all the work you guys are thinking about and doing! I'm excited to see these resources excel!


-Troy.



What I can do, is the processing of the Greek texts (which is natural to
me). I will be happy to collaborate on this with anyone else interested.

David: for now, I think there is nothing I would need from you, I still have
to progress myself.  You also mentioned polycarp66. Who is he? Maybe he
could help out also?

It would be good to post this to the sword list in case others are working
on similar issue: Troy, you can do this if you think it would be beneficial.

With love in our Lord Jesus Christ,
Costas

P.S. (maybe Chris should be reading this also since I think he is the module
expert? not sure...)




----- Original Message ----- From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, April 03, 2004 8:08 PM Subject: Re: Westcott-Hort



Costas,

I wrote software that extracted the strong's numbers from the unaccented

W.H.


byztxt.com and inserted them in the accented W.H. from CCEL.
The software examines both texts 1 verse at a time, creates a word list

for


each, attempts to compare the words ignoring case, and accents, and paying
attention to the order of the words while attempting to find a match for

where a


strong's number should be placed.
Reinserting the strongs numbers would probably be possible, but require

some


modification of the software since it looks at the html encoded unicode
(&#XXX) in the pages from CCEL. The html files that I have, may not be the

same as


what is on CCEL, polycarp66 fixed some errors, missing text etc. He sent

all of


his corrections to CCEL, but I do not if they have replaced their files

with


his corrected ones. It has been a while since I worked on the W.H. I will

try


to locate all of the current files. I do not know greek, so all I can do

is


fix character encodings, remove the extra spaces etc.
It may be best to correct the html files (for any corrections that must be
done by hand), and then reprocess everything.
If needed the html files could simply be converted to utf8, and the

strong's


numbers left out. (If you do not need the strong's numbers.)
The text could be reprocessed with strong's numbers inserted for Troy.

There were also some differences in versification between the texts.

I have attached zip file containg :
wc_a_.txt
The processed W.H. from byztxt.com (verse per line) it is encoded for the
OLBGreek font.

This verse was removed, because it appears to be completely enclosed in
variant markers, does not exist in the CCEL W.H., and I did not know what

to do


with it.

12:47 | | [eipen 2036 5627 {V-2AAI-3S} de 1161 {CONJ} tiv 5100 {X-NSM}

autw


846 {P-DSM} idou 2400 5628 {V-2AAM-2S} h 3588 {T-NSF} mhthr 3384 {N-NSF}

sou


4675 {P-2GS} kai 2532 {CONJ} oi 3588 {T-NPM} adelfoi 80 {N-NPM} sou 4675

{P-2GS}


exw 1854 {ADV} esthkasin 2476 5758 {V-RAI-3P} zhtountev 2212 5723

{V-PAP-NPM}


soi 4671 {P-2DS} lalhsai] 2980 5658 {V-AAN} |

wh_b.txt
The text that is actually stripped from the CCEL html files with
versification modified to match the versification of the text from

byztxt.com.


I am not sure what needs to be done, so you will have to tell me what you
need me to do.

David





_______________________________________________ sword-devel mailing list [EMAIL PROTECTED] http://www.crosswire.org/mailman/listinfo/sword-devel

Reply via email to