Costas,
A few comments...Costas Stergiou wrote:
Hi David/Troy, looking at the texts, I think there is some work to be done: - remove any combining diacriticals & process everything as precomposed.
I think this is backwards. From my limited understanding and from reading recent posts on sword-devel from people with much more knowledge than me, I think the text should be stored with no precomposed characters. If the renderer needs to send precomposed characters to the display control, then it (sword can do this with an ICU filter, I think) can precompose them.
- remove common mistakes found by text processing (e.g. wrong letters) - fix missing spaces between some words - compare words with other accented texts to find other errors, etc.
Right now I am working on an accented greek text (a byzantine one) which looks very very good. It is supposed to be the official greek text used by the eastern orthodox church. Actually, it is very close to the byzantine (at some times with the TR). I also have a printed version of it, and it does seem very good. I got it from http://kainh.homestead.com.
At the same time, I have been working on some other accented greek texts also. What I think is that by having all those accented texts, maybe I could right a util that takes almost any unaccented greek text, looking at it verse by verse and adding diacriticals by using the various accented versions I have. I am not sure that this is feasible but when I look at the differences between the texts, i realize they are very small and most (if not all of them) can be found programmaticaly.
About the WH you send me: i would like to test it through all the various scripts i have and make any corrections taking my time.
One think is important here: All the above can only happen on the texts WITHOUT the strong & morph tags. So, I suppose, we need to find a generic way of adding these later. I think it is not difficult, but since I don't know the specifics I cannot tell.
You can iterate thru the text without Strong's with a very simple routine:
SWMgr swordLibrary;
swordLibrary.setGlobalOption("Strong's Numbers", "Off");
SWModule *whac = swordLibrary.Modules["WHAC"];
for ((*whac) = TOP; !whac->Error(); (*whac)++) {
cout << whac->RenderText();
}I would suggest using your scripts to find errors with the above code, then correcting the error in the module with strongs/morph. You can export the module with mod2osis or mod2imp-- whichever is easier for you to work with. Then you can import it back with osis2mod or imp2vs.
Thanks for all the work you guys are thinking about and doing! I'm excited to see these resources excel!
-Troy.
What I can do, is the processing of the Greek texts (which is natural to me). I will be happy to collaborate on this with anyone else interested.
David: for now, I think there is nothing I would need from you, I still have to progress myself. You also mentioned polycarp66. Who is he? Maybe he could help out also?
It would be good to post this to the sword list in case others are working on similar issue: Troy, you can do this if you think it would be beneficial.
With love in our Lord Jesus Christ, Costas
P.S. (maybe Chris should be reading this also since I think he is the module expert? not sure...)
----- Original Message ----- From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, April 03, 2004 8:08 PM Subject: Re: Westcott-Hort
Costas,
I wrote software that extracted the strong's numbers from the unaccented
W.H.
byztxt.com and inserted them in the accented W.H. from CCEL. The software examines both texts 1 verse at a time, creates a word list
for
each, attempts to compare the words ignoring case, and accents, and paying attention to the order of the words while attempting to find a match for
where a
strong's number should be placed. Reinserting the strongs numbers would probably be possible, but require
some
modification of the software since it looks at the html encoded unicode (&#XXX) in the pages from CCEL. The html files that I have, may not be the
same as
what is on CCEL, polycarp66 fixed some errors, missing text etc. He sent
all of
his corrections to CCEL, but I do not if they have replaced their files
with
his corrected ones. It has been a while since I worked on the W.H. I will
try
to locate all of the current files. I do not know greek, so all I can do
is
fix character encodings, remove the extra spaces etc. It may be best to correct the html files (for any corrections that must be done by hand), and then reprocess everything. If needed the html files could simply be converted to utf8, and the
strong's
numbers left out. (If you do not need the strong's numbers.) The text could be reprocessed with strong's numbers inserted for Troy.
There were also some differences in versification between the texts.
I have attached zip file containg : wc_a_.txt The processed W.H. from byztxt.com (verse per line) it is encoded for the OLBGreek font.
This verse was removed, because it appears to be completely enclosed in variant markers, does not exist in the CCEL W.H., and I did not know what
to do
with it.
12:47 | | [eipen 2036 5627 {V-2AAI-3S} de 1161 {CONJ} tiv 5100 {X-NSM}
autw
846 {P-DSM} idou 2400 5628 {V-2AAM-2S} h 3588 {T-NSF} mhthr 3384 {N-NSF}
sou
4675 {P-2GS} kai 2532 {CONJ} oi 3588 {T-NPM} adelfoi 80 {N-NPM} sou 4675
{P-2GS}
exw 1854 {ADV} esthkasin 2476 5758 {V-RAI-3P} zhtountev 2212 5723
{V-PAP-NPM}
soi 4671 {P-2DS} lalhsai] 2980 5658 {V-AAN} |
wh_b.txt The text that is actually stripped from the CCEL html files with versification modified to match the versification of the text from
byztxt.com.
I am not sure what needs to be done, so you will have to tell me what you need me to do.
David
_______________________________________________ sword-devel mailing list [EMAIL PROTECTED] http://www.crosswire.org/mailman/listinfo/sword-devel
