I've been working long hours and emailing in my break time. David has the basics of converting to VPL.
I would then make the entire work a column in a spreadsheet. Then in other collumns insert a list of Book/chapter/verse in order. The BCV and versetext columns should align and can be verified, and adjusted where things don't match perfectly, like maybe 3 John has 15 instead of 14 verses. Once the columns align, you can merge them into another column via concatenation operations (&). This last column becomes your output. The output needs to consider that section titles and section ranges belong in front of the verse marker. That is a bit more complex search and replace, but can be done successfully. On Wed, May 15, 2019 at 11:12 AM David Haslam <dfh...@protonmail.com> wrote: > The attachment contains a counted list of Myanmar words containing a font > conversion error. > *NB. We need to match these words with what they are in the legacy font.* > > This issue should be discussed with the current maintainer of the SIL > *TECkit* converter, whoever that may be. > > It may be worthwhile asking our friends at the SIL *Writing Systems > Technology* team. See > https://scripts.sil.org/default > > *Aside: My friend Martin Hosken of SIL knew the late Keith Stribley - the > former webmaster of ThanLwinSoft.* > > Best regards, > > David > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Wednesday, May 15, 2019 4:41 PM, David Haslam <dfh...@protonmail.com> > wrote: > > *Observations: (continued)* > > 5. The string "*Kd;*" also looks anomalous. It's found only once in > ကိုယ်တော်၏ဦးခေါင်းတော်အပေါ်၌ လည်း ဤသူသည်ကား ဂျူးလူမျ Kd;တို့၏ဘုရင်၊ > > 6. It's evident from the PDF file that the text is paragraphed with > indented first lines. See > > https://www.dropbox.com/s/do5e675i19xfomf/Screenshot%202019-05-15%2016.29.10.png?dl=0 > > My hunch is that these leading paragraph indents may have been coded > within contents.xml as the self-closing element *<text:tab/>*. There are > 372 matches to this. > > So not only do we need to provide chapter and verse tags (plus section > headings & parallel passage titles, etc), we also need to reconstruct all > the paragraph tags. > > *NB. All structural XML indents were removed by the filter "Remove blanks > at SOL" in the file **contents.pp.tx** that was output by my simple > TextPipe filter. So that's quite a different matter.* > > Best regards, > > David > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Wednesday, May 15, 2019 2:22 PM, David Haslam <dfh...@protonmail.com> > wrote: > > *Observations: (continued)* > > 4. In addition to the reported instances of the anomalous 3 characters ( > *È,Ø,ò*) found after the font conversion, > there are 6 instances of the string "*m;*" that are also probably due to > bugs in the converter. > > Best regards, > > David > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Wednesday, May 15, 2019 12:41 PM, David Haslam <dfh...@protonmail.com> > wrote: > > Yep - sure - later I can do that. > > David > > Sent from ProtonMail Mobile > > > On Wed, May 15, 2019 at 11:26, Cyrille <lafricai...@gmail.com> wrote: > > David I have no count in box, and I want not to create one. Can you push > on https://framadrop.org/ it's totally free and secure (and private). > Thank you. > > > Il 15/05/2019 11:46, David Haslam ha scritto: > > Interim progress report. > > I downloaded the file Mat_utf8.zip from Cyrille's link and unzipped the > contents to Mat_utf8-odt > > I opened the .odt file using 7-Zip from the Windows Explorer context menu, > and extracted the file contents.xml > > I used Notepad++ plug-in XMLTools to pretty print the XML file and saved it > as contents.pp.xml > This is simply a layout change that's easier to read. > > I viewed the .pp.xml file in BabelPad, which confirmed that the non-XML text > was (mostly) Myanmar Unicode. > > I used a TextPipe filter to remove all XML tags, blanks from SOL & EOL and > all blank lines. > The output file is now contents.pp.txt > > This is now something that's readable content in Myanmar Unicode, with some > English text such as "The Gospel according Matthew" near the start. > > The file is best viewed using BabelPad with the option Display Colours | > Colour Code by Script. > This shows Myanmar characters in light green, and non-Myanmar characters in > other colours. > > Observations: > 1. The font conversion to Unicode left a few scattered characters > unconverted. :( > > 0000C8 È 18 LATIN CAPITAL LETTER E WITH GRAVE > 0000D8 Ø 20 LATIN CAPITAL LETTER O WITH STROKE > 0000F2 ò 3 LATIN SMALL LETTER O WITH GRAVE > > The complete character frequency analysis is attached. > > 2. A few verse numbers? are still present here and there. > 3. The content contains section headings and parallel passage headings as > well as verse text. > > I have just uploaded the file contents.pp.zip to a new folder in my Box > account and added Cyrille & Michael as viewers. > > > Best regards, > > David > > Sent with ProtonMail Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Monday, May 13, 2019 9:19 AM, Cyrille <lafricai...@gmail.com> > <lafricai...@gmail.com> wrote: > > > Hello, > I recently receive a modern translation of Myanmar of the NT, Psalms and > Proverbs with permission to create a new module. > But the problems are many... Firs to get the text. > I tested different way, but it's done with PageMaker! > I can get the text but the problem is I don't have the verses number > because they are next in a parallel column and when I copy it I have > only the biblical text. > I have a pdf also but when I convert it to text (with pdftotext) the > columns are mixed. > Someone can help me whit any idea? > Next problem is the Unicode... The text is not typed in unicode but use > a special font. > I can send everything you need or push it the git.crosswire. > > Thanks for help. > > sword-devel mailing list: > sword-devel@crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > > > > _______________________________________________ > sword-devel mailing list: > sword-devel@crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > > > > > > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page