Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Michael H
If the text doesn't come from an active team with people who read Burmese:
I am on another group list which has relationships with people working on
minority languages related to Burmese, but not Myanmarese. I can ask if
anyone can verify the unicode conversion works there if that's the only
option.


On Tue, May 14, 2019 at 7:34 PM Michael H  wrote:

> I don't read Burmese and I don't know anyone who does. I was suggesting
> you contact whoever provided you the files and permission, and ask them if
> they can verify the unicode text reads correctly.
>
> To a native speaker, the text will quickly be recognized as readable, or
> misspelled (letters are out of place.) It's better to check that now than
> spend a lot of time on the numbering.
>
> In regard to the verse numbering.  What I described (insert markers based
> on a reg ex search, then apply markup based on a sequential list) would
> take me about 2-4 hours to complete.  At the end of that the text will have
> proper USFM tags for book titles, section titles,  section
> references/ranges, chapter numbers, and verse numbers.  It will need to be
> checked that the verse numbers are well placed.  This is depending on the
> text not having issues on my machine.  I haven't tested all my tools to
> respect Burmese text.
>
>
___
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Michael H
I don't read Burmese and I don't know anyone who does. I was suggesting you
contact whoever provided you the files and permission, and ask them if they
can verify the unicode text reads correctly.

To a native speaker, the text will quickly be recognized as readable, or
misspelled (letters are out of place.) It's better to check that now than
spend a lot of time on the numbering.

In regard to the verse numbering.  What I described (insert markers based
on a reg ex search, then apply markup based on a sequential list) would
take me about 2-4 hours to complete.  At the end of that the text will have
proper USFM tags for book titles, section titles,  section
references/ranges, chapter numbers, and verse numbers.  It will need to be
checked that the verse numbers are well placed.  This is depending on the
text not having issues on my machine.  I haven't tested all my tools to
respect Burmese text.
___
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Cyrille


Il 14/05/2019 22:48, Michael H ha scritto:
> You should be able to configure a regex search to find the verse
> boundaries.
>
> Once you have verse boundaries, if you configure the text into Verse
> per line it should be possible to assign each row a chapter and verse
> number from a reference. That is, the 3341 verse in the New Testament
> is usually John 20:31 (I don't have that memorized, just an example.)

I have no idea how to do this :)
>
> On Tue, May 14, 2019 at 3:22 PM Cyrille  > wrote:
>
> Ok thank you!  I have already all the text in unicode but without
> the verse numbers and chapters... I begun manually...
>
> Il 14/05/2019 22:17, David Haslam ha scritto:
>> Hi Cyrille 
>>
>> If I can find the time tomorrow or later, I’ll have a look at
>> what might be feasible. 
>>
>> Thanks for all these useful links. 
>>
>> David
>>
>> Sent from ProtonMail Mobile
>>
>>
>> On Tue, May 14, 2019 at 14:08, Cyrille > > wrote:
>>> I send my message again because it was bigger.
>>>
>>> The conversion to UTF-8 is 99% solved!! I used a online converter:
>>> 
>>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>>> or:
>>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>>
>>> See the result here
>>> 
>>> .
>>>
>>> Now the only problem is how to get the verse and chapter number...
>>>
>>>
>>> Il 14/05/2019 13:53, Michael H ha scritto:
 Cyrille, (Peter), 

 Maybe further discussion on this belongs in Gitlab as issues. 
 Can I get added to this project? 

 Here are the first few lines of Matthew copied from the PDF: 
 --
 OD; {0Ha*vdusrf;
 The Gospel According to Matthew
 ed'gef;
 usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
 usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl
 sK;d tmvaf z;O;D \om;jzp\f / (rmu k2;14)
 olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
 a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
 av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS
 ahf wG U Ny;D

 -
 And here are the first few lines of Matthew copied from the
 Pagemaker file: 
 -
 Sifrmaw;OD; {0Ha*vdusrf;
 The Gospel According to Matthew
 ed'gef;
 usrf;�yyk*�dKvf  OD;\b0rSwfwrf;  
 usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd;
 tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf
 trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD 
 ol\trnfrSm av0djzpf\/ olonf  wdab;,tkdifteD;wGif 
 a,Zl;ocifESifhawGU  NyD;


 You can see that some letters have changed, and some others are
 in a different order. 

 The letters that change are likely those points that aren't
 compatible with unicode, and pagemaker reassigned them to
 ensure that the file is more widely viewable. Since a
 conversion is already planned, these won't matter as much, but
 the font embedded in the PDF is different than the font
 attached to the pagemaker file,  If you do start from the PDF,
 you'll need to extract the font to get the code points. 

 The problem is that the PDF export from pagemaker sorts the
 letters into the order they appear on the page.  Burmese text
 has Indian style ligatures, where vowels tend to jump over or
 under the previous letters, sometimes back 2 or three letters.
 If you study the following snippets from the beginning of
 Matthew, you can see there is a difference in order, as well as
 some glyphs are modified. 

 So, from the PDF letters are out of order, but from Pagemaker,
 letters are encoded into control points. Fixing the control
 points is easy and happens with the unicode conversion.  Fixing
 the letter order is not easy. You'll need a first language
 speaker and plenty of time. 

 The guidance I received on another group was to use either LO
 Draw or Indesign to export the text from Pagemaker.  I'll look
 into LO Draw again, but I don't have access to an older version
 of Indesign (the pagemaker import was removed in CS6). 


 On Mon, May 13, 2019 at 10:40 AM Michael H >>> > wrote:

 I unzipped the pagemaker file, and when I open
 NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can
 'find' all of the book names, and see the text there.  

 To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip
 and open it with a zip archive progeram.  

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Cyrille


Il 14/05/2019 22:55, Cyrille ha scritto:
>
>
> Il 14/05/2019 22:45, Michael H ha scritto:
>> Cyrille, did you start from the PDF or the pagemaker file?
> PMaker
>> Either way, you should send a snippet to your source and validate the
>> words are still readable. As small as 30 words should be enough.
The convert text? If yes look the attached file.
>>
>> On Tue, May 14, 2019 at 8:09 AM Cyrille > > wrote:
>>
>> I send my message again because it was bigger.
>>
>> The conversion to UTF-8 is 99% solved!! I used a online converter:
>> 
>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>> or:
>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>
>> See the result here
>> 
>> .
>>
>> Now the only problem is how to get the verse and chapter number...
>>
>>
>> Il 14/05/2019 13:53, Michael H ha scritto:
>>> Cyrille, (Peter), 
>>>
>>> Maybe further discussion on this belongs in Gitlab as issues. 
>>> Can I get added to this project? 
>>>
>>> Here are the first few lines of Matthew copied from the PDF: 
>>> --
>>> OD; {0Ha*vdusrf;
>>> The Gospel According to Matthew
>>> ed'gef;
>>> usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
>>> usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d
>>> tmvaf z;O;D \om;jzp\f / (rmu k2;14)
>>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
>>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf
>>> wG U Ny;D
>>>
>>> -
>>> And here are the first few lines of Matthew copied from the
>>> Pagemaker file: 
>>> -
>>> Sifrmaw;OD; {0Ha*vdusrf;
>>> The Gospel According to Matthew
>>> ed'gef;
>>> usrf;�yyk*�dKvf  OD;\b0rSwfwrf;  
>>> usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd;
>>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf
>>> trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD 
>>> ol\trnfrSm av0djzpf\/ olonf  wdab;,tkdifteD;wGif 
>>> a,Zl;ocifESifhawGU  NyD;
>>>
>>>
>>> You can see that some letters have changed, and some others are
>>> in a different order. 
>>>
>>> The letters that change are likely those points that aren't
>>> compatible with unicode, and pagemaker reassigned them to ensure
>>> that the file is more widely viewable. Since a conversion is
>>> already planned, these won't matter as much, but the font
>>> embedded in the PDF is different than the font attached to the
>>> pagemaker file,  If you do start from the PDF, you'll need to
>>> extract the font to get the code points. 
>>>
>>> The problem is that the PDF export from pagemaker sorts the
>>> letters into the order they appear on the page.  Burmese text
>>> has Indian style ligatures, where vowels tend to jump over or
>>> under the previous letters, sometimes back 2 or three letters.
>>> If you study the following snippets from the beginning of
>>> Matthew, you can see there is a difference in order, as well as
>>> some glyphs are modified. 
>>>
>>> So, from the PDF letters are out of order, but from Pagemaker,
>>> letters are encoded into control points. Fixing the control
>>> points is easy and happens with the unicode conversion.  Fixing
>>> the letter order is not easy. You'll need a first language
>>> speaker and plenty of time. 
>>>
>>> The guidance I received on another group was to use either LO
>>> Draw or Indesign to export the text from Pagemaker.  I'll look
>>> into LO Draw again, but I don't have access to an older version
>>> of Indesign (the pagemaker import was removed in CS6). 
>>>
>>>
>>> On Mon, May 13, 2019 at 10:40 AM Michael H >> > wrote:
>>>
>>> I unzipped the pagemaker file, and when I open
>>> NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can
>>> 'find' all of the book names, and see the text there.  
>>>
>>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip
>>> and open it with a zip archive progeram.  The text is in the
>>> Pagemaker file at the top level of the archive, but encoded
>>> with a lot of extraneous information.  (The English text
>>> "Matthew" appears at hex location 7A76972). 
>>>
>>> When I open the fonts with fontforge, Fontforge suggests the
>>> fonts are encoded as unicode (but the glyphs are obviously
>>> not in the right spot.) 
>>> However when I copy the text (I copied from LO Draw) and
>>> paste it into jedit and save that as unicode: Reopening the
>>> file has a warning 'not unicode, text may be missing'. 
>>>
>>> So, what this means is that there are some glyphs encoded
>>> 

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Michael H
You should be able to configure a regex search to find the verse
boundaries.

Once you have verse boundaries, if you configure the text into Verse per
line it should be possible to assign each row a chapter and verse number
from a reference. That is, the 3341 verse in the New Testament is usually
John 20:31 (I don't have that memorized, just an example.)

On Tue, May 14, 2019 at 3:22 PM Cyrille  wrote:

> Ok thank you!  I have already all the text in unicode but without the
> verse numbers and chapters... I begun manually...
>
> Il 14/05/2019 22:17, David Haslam ha scritto:
>
> Hi Cyrille
>
> If I can find the time tomorrow or later, I’ll have a look at what might
> be feasible.
>
> Thanks for all these useful links.
>
> David
>
> Sent from ProtonMail Mobile
>
>
> On Tue, May 14, 2019 at 14:08, Cyrille  wrote:
>
> I send my message again because it was bigger.
>
> The conversion to UTF-8 is 99% solved!! I used a online converter:
>
> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
> or:
> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>
> See the result here
> 
> .
>
> Now the only problem is how to get the verse and chapter number...
>
>
> Il 14/05/2019 13:53, Michael H ha scritto:
>
> Cyrille, (Peter),
>
> Maybe further discussion on this belongs in Gitlab as issues.  Can I get
> added to this project?
>
> Here are the first few lines of Matthew copied from the PDF:
> --
> OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
> usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf
> z;O;D \om;jzp\f / (rmu k2;14)
> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D
>
> -
> And here are the first few lines of Matthew copied from the Pagemaker
> file:
> -
> Sifrmaw;OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usrf;�yyk*�dKvf  OD;\b0rSwfwrf;
> usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd;
> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/ (vk
> 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/ olonf
> wdab;,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;
>
>
> You can see that some letters have changed, and some others are in a
> different order.
>
> The letters that change are likely those points that aren't compatible
> with unicode, and pagemaker reassigned them to ensure that the file is more
> widely viewable. Since a conversion is already planned, these won't matter
> as much, but the font embedded in the PDF is different than the font
> attached to the pagemaker file,  If you do start from the PDF, you'll need
> to extract the font to get the code points.
>
> The problem is that the PDF export from pagemaker sorts the letters into
> the order they appear on the page.  Burmese text has Indian style
> ligatures, where vowels tend to jump over or under the previous letters,
> sometimes back 2 or three letters. If you study the following snippets from
> the beginning of Matthew, you can see there is a difference in order, as
> well as some glyphs are modified.
>
> So, from the PDF letters are out of order, but from Pagemaker, letters are
> encoded into control points. Fixing the control points is easy and happens
> with the unicode conversion.  Fixing the letter order is not easy. You'll
> need a first language speaker and plenty of time.
>
> The guidance I received on another group was to use either LO Draw or
> Indesign to export the text from Pagemaker.  I'll look into LO Draw again,
> but I don't have access to an older version of Indesign (the pagemaker
> import was removed in CS6).
>
>
> On Mon, May 13, 2019 at 10:40 AM Michael H  wrote:
>
>> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker
>> (10.1mb), with a Hex editor, I can 'find' all of the book names, and see
>> the text there.
>>
>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open it
>> with a zip archive progeram.  The text is in the Pagemaker file at the top
>> level of the archive, but encoded with a lot of extraneous information.
>> (The English text "Matthew" appears at hex location 7A76972).
>>
>> When I open the fonts with fontforge, Fontforge suggests the fonts are
>> encoded as unicode (but the glyphs are obviously not in the right spot.)
>> However when I copy the text (I copied from LO Draw) and paste it into
>> jedit and save that as unicode: Reopening the file has a warning 'not
>> unicode, text may be missing'.
>>
>> So, what this means is that there are some glyphs encoded into locations
>> that unicode treats as control or non-printing codes. The text needs to be
>> dealt with as a specific encoding that matches whatever the original font
>> actually uses. I haven't figured out what the 

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Michael H
Cyrille, did you start from the PDF or the pagemaker file? Either way, you
should send a snippet to your source and validate the words are still
readable. As small as 30 words should be enough.

On Tue, May 14, 2019 at 8:09 AM Cyrille  wrote:

> I send my message again because it was bigger.
>
> The conversion to UTF-8 is 99% solved!! I used a online converter:
>
> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
> or:
> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>
> See the result here
> 
> .
>
> Now the only problem is how to get the verse and chapter number...
>
>
> Il 14/05/2019 13:53, Michael H ha scritto:
>
> Cyrille, (Peter),
>
> Maybe further discussion on this belongs in Gitlab as issues.  Can I get
> added to this project?
>
> Here are the first few lines of Matthew copied from the PDF:
> --
> OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
> usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf
> z;O;D \om;jzp\f / (rmu k2;14)
> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D
>
> -
> And here are the first few lines of Matthew copied from the Pagemaker
> file:
> -
> Sifrmaw;OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usrf;�yyk*�dKvf  OD;\b0rSwfwrf;
> usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd;
> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/ (vk
> 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/ olonf
> wdab;,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;
>
>
> You can see that some letters have changed, and some others are in a
> different order.
>
> The letters that change are likely those points that aren't compatible
> with unicode, and pagemaker reassigned them to ensure that the file is more
> widely viewable. Since a conversion is already planned, these won't matter
> as much, but the font embedded in the PDF is different than the font
> attached to the pagemaker file,  If you do start from the PDF, you'll need
> to extract the font to get the code points.
>
> The problem is that the PDF export from pagemaker sorts the letters into
> the order they appear on the page.  Burmese text has Indian style
> ligatures, where vowels tend to jump over or under the previous letters,
> sometimes back 2 or three letters. If you study the following snippets from
> the beginning of Matthew, you can see there is a difference in order, as
> well as some glyphs are modified.
>
> So, from the PDF letters are out of order, but from Pagemaker, letters are
> encoded into control points. Fixing the control points is easy and happens
> with the unicode conversion.  Fixing the letter order is not easy. You'll
> need a first language speaker and plenty of time.
>
> The guidance I received on another group was to use either LO Draw or
> Indesign to export the text from Pagemaker.  I'll look into LO Draw again,
> but I don't have access to an older version of Indesign (the pagemaker
> import was removed in CS6).
>
>
> On Mon, May 13, 2019 at 10:40 AM Michael H  wrote:
>
>> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker
>> (10.1mb), with a Hex editor, I can 'find' all of the book names, and see
>> the text there.
>>
>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open it
>> with a zip archive progeram.  The text is in the Pagemaker file at the top
>> level of the archive, but encoded with a lot of extraneous information.
>> (The English text "Matthew" appears at hex location 7A76972).
>>
>> When I open the fonts with fontforge, Fontforge suggests the fonts are
>> encoded as unicode (but the glyphs are obviously not in the right spot.)
>> However when I copy the text (I copied from LO Draw) and paste it into
>> jedit and save that as unicode: Reopening the file has a warning 'not
>> unicode, text may be missing'.
>>
>> So, what this means is that there are some glyphs encoded into locations
>> that unicode treats as control or non-printing codes. The text needs to be
>> dealt with as a specific encoding that matches whatever the original font
>> actually uses. I haven't figured out what the original text files were
>> encoded with. Without that knowledge, I'm not sure my system clipboard or
>> editor (jedit) will properly respect the glyphs in unusual locations until
>> the conversion to unicode, and I don't trust myself to be able to detect if
>> it is or is not properly converted.
>>
>> On Mon, May 13, 2019 at 10:11 AM Cyrille  wrote:
>>
>>> David,
>>> Probably you are right about TECkit
>>> ,
>>> if we get the text it will help us to convert in UNICODE.
>>> About how to get the text, your method 

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Cyrille


Il 14/05/2019 22:26, David Haslam ha scritto:
> If Michael’s observations are anything to go by, then maybe I can
> script the recovery of chapter & verse tags. 
>
> We shall see 
>
> Even if I’m not immediately successful - valuable lessons can be
> learned in the attempt.
Very, well, I'll wait for you ;)
>
> David
>
> Sent from ProtonMail Mobile
>
>
> On Tue, May 14, 2019 at 21:21, Cyrille  > wrote:
>> Ok thank you!  I have already all the text in unicode but without the
>> verse numbers and chapters... I begun manually...
>>
>> Il 14/05/2019 22:17, David Haslam ha scritto:
>>> Hi Cyrille 
>>>
>>> If I can find the time tomorrow or later, I’ll have a look at what
>>> might be feasible. 
>>>
>>> Thanks for all these useful links. 
>>>
>>> David
>>>
>>> Sent from ProtonMail Mobile
>>>
>>>
>>> On Tue, May 14, 2019 at 14:08, Cyrille >> > wrote:
 I send my message again because it was bigger.

 The conversion to UTF-8 is 99% solved!! I used a online converter:
 https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
 or:
 http://burglish.my-mm.org/latest/trunk/web/fontconv.htm

 See the result here
 .

 Now the only problem is how to get the verse and chapter number...


 Il 14/05/2019 13:53, Michael H ha scritto:
> Cyrille, (Peter), 
>
> Maybe further discussion on this belongs in Gitlab as issues.  Can
> I get added to this project? 
>
> Here are the first few lines of Matthew copied from the PDF: 
> --
> OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
> usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d
> tmvaf z;O;D \om;jzp\f / (rmu k2;14)
> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf
> wG U Ny;D
>
> -
> And here are the first few lines of Matthew copied from the
> Pagemaker file: 
> -
> Sifrmaw;OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usrf;�yyk*�dKvf  OD;\b0rSwfwrf;  
> usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd;
> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf
> trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD 
> ol\trnfrSm av0djzpf\/ olonf  wdab;,tkdifteD;wGif 
> a,Zl;ocifESifhawGU  NyD;
>
>
> You can see that some letters have changed, and some others are in
> a different order. 
>
> The letters that change are likely those points that aren't
> compatible with unicode, and pagemaker reassigned them to ensure
> that the file is more widely viewable. Since a conversion is
> already planned, these won't matter as much, but the font embedded
> in the PDF is different than the font attached to the pagemaker
> file,  If you do start from the PDF, you'll need to extract the
> font to get the code points. 
>
> The problem is that the PDF export from pagemaker sorts the
> letters into the order they appear on the page.  Burmese text has
> Indian style ligatures, where vowels tend to jump over or under
> the previous letters, sometimes back 2 or three letters. If you
> study the following snippets from the beginning of Matthew, you
> can see there is a difference in order, as well as some glyphs are
> modified. 
>
> So, from the PDF letters are out of order, but from Pagemaker,
> letters are encoded into control points. Fixing the control points
> is easy and happens with the unicode conversion.  Fixing the
> letter order is not easy. You'll need a first language speaker and
> plenty of time. 
>
> The guidance I received on another group was to use either LO Draw
> or Indesign to export the text from Pagemaker.  I'll look into LO
> Draw again, but I don't have access to an older version of
> Indesign (the pagemaker import was removed in CS6). 
>
>
> On Mon, May 13, 2019 at 10:40 AM Michael H  > wrote:
>
> I unzipped the pagemaker file, and when I open
> NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can 'find'
> all of the book names, and see the text there.  
>
> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip
> and open it with a zip archive progeram.  The text is in the
> Pagemaker file at the top level of the archive, but encoded
> with a lot of extraneous information.  (The English text
> "Matthew" appears at hex location 7A76972). 
>
> When I open the fonts with fontforge, Fontforge suggests the
> fonts are encoded as 

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread David Haslam
If Michael’s observations are anything to go by, then maybe I can script the 
recovery of chapter & verse tags.

We shall see 

Even if I’m not immediately successful - valuable lessons can be learned in the 
attempt.

David

Sent from ProtonMail Mobile

On Tue, May 14, 2019 at 21:21, Cyrille  wrote:

> Ok thank you!  I have already all the text in unicode but without the verse 
> numbers and chapters... I begun manually...
>
> Il 14/05/2019 22:17, David Haslam ha scritto:
>
>> Hi Cyrille
>>
>> If I can find the time tomorrow or later, I’ll have a look at what might be 
>> feasible.
>>
>> Thanks for all these useful links.
>>
>> David
>>
>> Sent from ProtonMail Mobile
>>
>> On Tue, May 14, 2019 at 14:08, Cyrille  wrote:
>>
>>> I send my message again because it was bigger.
>>>
>>> The conversion to UTF-8 is 99% solved!! I used a online converter:
>>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>>> or:
>>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>>
>>> See the result 
>>> [here](https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=).
>>>
>>> Now the only problem is how to get the verse and chapter number...
>>>
>>> Il 14/05/2019 13:53, Michael H ha scritto:
>>>
 Cyrille, (Peter),

 Maybe further discussion on this belongs in Gitlab as issues.  Can I get 
 added to this project?

 Here are the first few lines of Matthew copied from the PDF:
 --

 OD; {0Ha*vdusrf;
 The Gospel According to Matthew
 ed'gef;
 usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
 usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf 
 z;O;D \om;jzp\f / (rmu k2;14)
 olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27) 
 a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
 av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D
 -
 And here are the first few lines of Matthew copied from the Pagemaker file:
 -
 Sifrmaw;OD; {0Ha*vdusrf;
 The Gospel According to Matthew
 ed'gef;
 usrf;�yyk*�dKvf  OD;\b0rSwfwrf;
 usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd; 
 tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/ (vk 
 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/ olonf  
 wdab;,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;

 You can see that some letters have changed, and some others are in a 
 different order.

 The letters that change are likely those points that aren't compatible 
 with unicode, and pagemaker reassigned them to ensure that the file is 
 more widely viewable. Since a conversion is already planned, these won't 
 matter as much, but the font embedded in the PDF is different than the 
 font attached to the pagemaker file,  If you do start from the PDF, you'll 
 need to extract the font to get the code points.

 The problem is that the PDF export from pagemaker sorts the letters into 
 the order they appear on the page.  Burmese text has Indian style 
 ligatures, where vowels tend to jump over or under the previous letters, 
 sometimes back 2 or three letters. If you study the following snippets 
 from the beginning of Matthew, you can see there is a difference in order, 
 as well as some glyphs are modified.

 So, from the PDF letters are out of order, but from Pagemaker, letters are 
 encoded into control points. Fixing the control points is easy and happens 
 with the unicode conversion.  Fixing the letter order is not easy. You'll 
 need a first language speaker and plenty of time.

 The guidance I received on another group was to use either LO Draw or 
 Indesign to export the text from Pagemaker.  I'll look into LO Draw again, 
 but I don't have access to an older version of Indesign (the pagemaker 
 import was removed in CS6).

 On Mon, May 13, 2019 at 10:40 AM Michael H  wrote:

> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker 
> (10.1mb), with a Hex editor, I can 'find' all of the book names, and see 
> the text there.
>
> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open it 
> with a zip archive progeram.  The text is in the Pagemaker file at the 
> top level of the archive, but encoded with a lot of extraneous 
> information.  (The English text "Matthew" appears at hex location 
> 7A76972).
>
> When I open the fonts with fontforge, Fontforge suggests the fonts are 
> encoded as unicode (but the glyphs are obviously not in the right spot.)
> However when I copy the text (I copied from LO Draw) and paste it into 
> jedit and save that as unicode: Reopening the file has a warning 'not 
> unicode, text may be missing'.
>
> So, what this means is that there are some glyphs encoded into locations 
> that unicode 

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Cyrille
Ok thank you!  I have already all the text in unicode but without the
verse numbers and chapters... I begun manually...

Il 14/05/2019 22:17, David Haslam ha scritto:
> Hi Cyrille 
>
> If I can find the time tomorrow or later, I’ll have a look at what
> might be feasible. 
>
> Thanks for all these useful links. 
>
> David
>
> Sent from ProtonMail Mobile
>
>
> On Tue, May 14, 2019 at 14:08, Cyrille  > wrote:
>> I send my message again because it was bigger.
>>
>> The conversion to UTF-8 is 99% solved!! I used a online converter:
>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>> or:
>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>
>> See the result here
>> .
>>
>> Now the only problem is how to get the verse and chapter number...
>>
>>
>> Il 14/05/2019 13:53, Michael H ha scritto:
>>> Cyrille, (Peter), 
>>>
>>> Maybe further discussion on this belongs in Gitlab as issues.  Can I
>>> get added to this project? 
>>>
>>> Here are the first few lines of Matthew copied from the PDF: 
>>> --
>>> OD; {0Ha*vdusrf;
>>> The Gospel According to Matthew
>>> ed'gef;
>>> usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
>>> usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d
>>> tmvaf z;O;D \om;jzp\f / (rmu k2;14)
>>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
>>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG
>>> U Ny;D
>>>
>>> -
>>> And here are the first few lines of Matthew copied from the
>>> Pagemaker file: 
>>> -
>>> Sifrmaw;OD; {0Ha*vdusrf;
>>> The Gospel According to Matthew
>>> ed'gef;
>>> usrf;�yyk*�dKvf  OD;\b0rSwfwrf;  
>>> usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd;
>>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf
>>> trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD 
>>> ol\trnfrSm av0djzpf\/ olonf  wdab;,tkdifteD;wGif 
>>> a,Zl;ocifESifhawGU  NyD;
>>>
>>>
>>> You can see that some letters have changed, and some others are in a
>>> different order. 
>>>
>>> The letters that change are likely those points that aren't
>>> compatible with unicode, and pagemaker reassigned them to ensure
>>> that the file is more widely viewable. Since a conversion is already
>>> planned, these won't matter as much, but the font embedded in the
>>> PDF is different than the font attached to the pagemaker file,  If
>>> you do start from the PDF, you'll need to extract the font to get
>>> the code points. 
>>>
>>> The problem is that the PDF export from pagemaker sorts the letters
>>> into the order they appear on the page.  Burmese text has Indian
>>> style ligatures, where vowels tend to jump over or under the
>>> previous letters, sometimes back 2 or three letters. If you study
>>> the following snippets from the beginning of Matthew, you can see
>>> there is a difference in order, as well as some glyphs are modified. 
>>>
>>> So, from the PDF letters are out of order, but from Pagemaker,
>>> letters are encoded into control points. Fixing the control points
>>> is easy and happens with the unicode conversion.  Fixing the letter
>>> order is not easy. You'll need a first language speaker and plenty
>>> of time. 
>>>
>>> The guidance I received on another group was to use either LO Draw
>>> or Indesign to export the text from Pagemaker.  I'll look into LO
>>> Draw again, but I don't have access to an older version of Indesign
>>> (the pagemaker import was removed in CS6). 
>>>
>>>
>>> On Mon, May 13, 2019 at 10:40 AM Michael H >> > wrote:
>>>
>>> I unzipped the pagemaker file, and when I open
>>> NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can 'find'
>>> all of the book names, and see the text there.  
>>>
>>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and
>>> open it with a zip archive progeram.  The text is in the
>>> Pagemaker file at the top level of the archive, but encoded with
>>> a lot of extraneous information.  (The English text "Matthew"
>>> appears at hex location 7A76972). 
>>>
>>> When I open the fonts with fontforge, Fontforge suggests the
>>> fonts are encoded as unicode (but the glyphs are obviously not
>>> in the right spot.) 
>>> However when I copy the text (I copied from LO Draw) and paste
>>> it into jedit and save that as unicode: Reopening the file has a
>>> warning 'not unicode, text may be missing'. 
>>>
>>> So, what this means is that there are some glyphs encoded into
>>> locations that unicode treats as control or non-printing codes.
>>> The text needs to be dealt with as a specific encoding that
>>> matches whatever the original font actually uses. I haven't
>>> figured out what the original text files were encoded with.
>>> Without that knowledge, I'm not sure my 

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread David Haslam
Hi Cyrille

If I can find the time tomorrow or later, I’ll have a look at what might be 
feasible.

Thanks for all these useful links.

David

Sent from ProtonMail Mobile

On Tue, May 14, 2019 at 14:08, Cyrille  wrote:

> I send my message again because it was bigger.
>
> The conversion to UTF-8 is 99% solved!! I used a online converter:
> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
> or:
> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>
> See the result 
> [here](https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=).
>
> Now the only problem is how to get the verse and chapter number...
>
> Il 14/05/2019 13:53, Michael H ha scritto:
>
>> Cyrille, (Peter),
>>
>> Maybe further discussion on this belongs in Gitlab as issues.  Can I get 
>> added to this project?
>>
>> Here are the first few lines of Matthew copied from the PDF:
>> --
>>
>> OD; {0Ha*vdusrf;
>> The Gospel According to Matthew
>> ed'gef;
>> usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
>> usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf z;O;D 
>> \om;jzp\f / (rmu k2;14)
>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27) 
>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D
>> -
>> And here are the first few lines of Matthew copied from the Pagemaker file:
>> -
>> Sifrmaw;OD; {0Ha*vdusrf;
>> The Gospel According to Matthew
>> ed'gef;
>> usrf;�yyk*�dKvf  OD;\b0rSwfwrf;
>> usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd; 
>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/ (vk 
>> 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/ olonf  
>> wdab;,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;
>>
>> You can see that some letters have changed, and some others are in a 
>> different order.
>>
>> The letters that change are likely those points that aren't compatible with 
>> unicode, and pagemaker reassigned them to ensure that the file is more 
>> widely viewable. Since a conversion is already planned, these won't matter 
>> as much, but the font embedded in the PDF is different than the font 
>> attached to the pagemaker file,  If you do start from the PDF, you'll need 
>> to extract the font to get the code points.
>>
>> The problem is that the PDF export from pagemaker sorts the letters into the 
>> order they appear on the page.  Burmese text has Indian style ligatures, 
>> where vowels tend to jump over or under the previous letters, sometimes back 
>> 2 or three letters. If you study the following snippets from the beginning 
>> of Matthew, you can see there is a difference in order, as well as some 
>> glyphs are modified.
>>
>> So, from the PDF letters are out of order, but from Pagemaker, letters are 
>> encoded into control points. Fixing the control points is easy and happens 
>> with the unicode conversion.  Fixing the letter order is not easy. You'll 
>> need a first language speaker and plenty of time.
>>
>> The guidance I received on another group was to use either LO Draw or 
>> Indesign to export the text from Pagemaker.  I'll look into LO Draw again, 
>> but I don't have access to an older version of Indesign (the pagemaker 
>> import was removed in CS6).
>>
>> On Mon, May 13, 2019 at 10:40 AM Michael H  wrote:
>>
>>> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker 
>>> (10.1mb), with a Hex editor, I can 'find' all of the book names, and see 
>>> the text there.
>>>
>>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open it 
>>> with a zip archive progeram.  The text is in the Pagemaker file at the top 
>>> level of the archive, but encoded with a lot of extraneous information.  
>>> (The English text "Matthew" appears at hex location 7A76972).
>>>
>>> When I open the fonts with fontforge, Fontforge suggests the fonts are 
>>> encoded as unicode (but the glyphs are obviously not in the right spot.)
>>> However when I copy the text (I copied from LO Draw) and paste it into 
>>> jedit and save that as unicode: Reopening the file has a warning 'not 
>>> unicode, text may be missing'.
>>>
>>> So, what this means is that there are some glyphs encoded into locations 
>>> that unicode treats as control or non-printing codes. The text needs to be 
>>> dealt with as a specific encoding that matches whatever the original font 
>>> actually uses. I haven't figured out what the original text files were 
>>> encoded with. Without that knowledge, I'm not sure my system clipboard or 
>>> editor (jedit) will properly respect the glyphs in unusual locations until 
>>> the conversion to unicode, and I don't trust myself to be able to detect if 
>>> it is or is not properly converted.
>>>
>>> On Mon, May 13, 2019 at 10:11 AM Cyrille  wrote:
>>>
 David,
 Probably you are right about 
 [TECkit](http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi_id=TECkit),
  if we 

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread David Haslam
The ThanLwinSoft software was indeed developed by Keith Stribley (1976-2011).

Screenshot posted to my Facebook timeline.

https://m.facebook.com/story.php?story_fbid=10213794210749822=1243443528

We had exchanged emails during the year before he died.

Best regards

David

Sent from ProtonMail Mobile

On Tue, May 14, 2019 at 14:08, Cyrille  wrote:

> I send my message again because it was bigger.
>
> The conversion to UTF-8 is 99% solved!! I used a online converter:
> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
> or:
> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>
> See the result 
> [here](https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=).
>
> Now the only problem is how to get the verse and chapter number...
>
> Il 14/05/2019 13:53, Michael H ha scritto:
>
>> Cyrille, (Peter),
>>
>> Maybe further discussion on this belongs in Gitlab as issues.  Can I get 
>> added to this project?
>>
>> Here are the first few lines of Matthew copied from the PDF:
>> --
>>
>> OD; {0Ha*vdusrf;
>> The Gospel According to Matthew
>> ed'gef;
>> usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
>> usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf z;O;D 
>> \om;jzp\f / (rmu k2;14)
>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27) 
>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D
>> -
>> And here are the first few lines of Matthew copied from the Pagemaker file:
>> -
>> Sifrmaw;OD; {0Ha*vdusrf;
>> The Gospel According to Matthew
>> ed'gef;
>> usrf;�yyk*�dKvf  OD;\b0rSwfwrf;
>> usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd; 
>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/ (vk 
>> 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/ olonf  
>> wdab;,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;
>>
>> You can see that some letters have changed, and some others are in a 
>> different order.
>>
>> The letters that change are likely those points that aren't compatible with 
>> unicode, and pagemaker reassigned them to ensure that the file is more 
>> widely viewable. Since a conversion is already planned, these won't matter 
>> as much, but the font embedded in the PDF is different than the font 
>> attached to the pagemaker file,  If you do start from the PDF, you'll need 
>> to extract the font to get the code points.
>>
>> The problem is that the PDF export from pagemaker sorts the letters into the 
>> order they appear on the page.  Burmese text has Indian style ligatures, 
>> where vowels tend to jump over or under the previous letters, sometimes back 
>> 2 or three letters. If you study the following snippets from the beginning 
>> of Matthew, you can see there is a difference in order, as well as some 
>> glyphs are modified.
>>
>> So, from the PDF letters are out of order, but from Pagemaker, letters are 
>> encoded into control points. Fixing the control points is easy and happens 
>> with the unicode conversion.  Fixing the letter order is not easy. You'll 
>> need a first language speaker and plenty of time.
>>
>> The guidance I received on another group was to use either LO Draw or 
>> Indesign to export the text from Pagemaker.  I'll look into LO Draw again, 
>> but I don't have access to an older version of Indesign (the pagemaker 
>> import was removed in CS6).
>>
>> On Mon, May 13, 2019 at 10:40 AM Michael H  wrote:
>>
>>> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker 
>>> (10.1mb), with a Hex editor, I can 'find' all of the book names, and see 
>>> the text there.
>>>
>>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open it 
>>> with a zip archive progeram.  The text is in the Pagemaker file at the top 
>>> level of the archive, but encoded with a lot of extraneous information.  
>>> (The English text "Matthew" appears at hex location 7A76972).
>>>
>>> When I open the fonts with fontforge, Fontforge suggests the fonts are 
>>> encoded as unicode (but the glyphs are obviously not in the right spot.)
>>> However when I copy the text (I copied from LO Draw) and paste it into 
>>> jedit and save that as unicode: Reopening the file has a warning 'not 
>>> unicode, text may be missing'.
>>>
>>> So, what this means is that there are some glyphs encoded into locations 
>>> that unicode treats as control or non-printing codes. The text needs to be 
>>> dealt with as a specific encoding that matches whatever the original font 
>>> actually uses. I haven't figured out what the original text files were 
>>> encoded with. Without that knowledge, I'm not sure my system clipboard or 
>>> editor (jedit) will properly respect the glyphs in unusual locations until 
>>> the conversion to unicode, and I don't trust myself to be able to detect if 
>>> it is or is not properly converted.
>>>
>>> On Mon, May 13, 2019 at 10:11 AM Cyrille  wrote:
>>>
 

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Cyrille
I send my message again because it was bigger.

The conversion to UTF-8 is 99% solved!! I used a online converter:
https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
or:
http://burglish.my-mm.org/latest/trunk/web/fontconv.htm

See the result here
.

Now the only problem is how to get the verse and chapter number...


Il 14/05/2019 13:53, Michael H ha scritto:
> Cyrille, (Peter), 
>
> Maybe further discussion on this belongs in Gitlab as issues.  Can I
> get added to this project? 
>
> Here are the first few lines of Matthew copied from the PDF: 
> --
> OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
> usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf
> z;O;D \om;jzp\f / (rmu k2;14)
> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D
>
> -
> And here are the first few lines of Matthew copied from the Pagemaker
> file: 
> -
> Sifrmaw;OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usrf;�yyk*�dKvf  OD;\b0rSwfwrf;  
> usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd;
> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/
> (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/
> olonf  wdab;,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;
>
>
> You can see that some letters have changed, and some others are in a
> different order. 
>
> The letters that change are likely those points that aren't compatible
> with unicode, and pagemaker reassigned them to ensure that the file is
> more widely viewable. Since a conversion is already planned, these
> won't matter as much, but the font embedded in the PDF is different
> than the font attached to the pagemaker file,  If you do start from
> the PDF, you'll need to extract the font to get the code points. 
>
> The problem is that the PDF export from pagemaker sorts the letters
> into the order they appear on the page.  Burmese text has Indian style
> ligatures, where vowels tend to jump over or under the previous
> letters, sometimes back 2 or three letters. If you study the following
> snippets from the beginning of Matthew, you can see there is a
> difference in order, as well as some glyphs are modified. 
>
> So, from the PDF letters are out of order, but from Pagemaker, letters
> are encoded into control points. Fixing the control points is easy and
> happens with the unicode conversion.  Fixing the letter order is not
> easy. You'll need a first language speaker and plenty of time. 
>
> The guidance I received on another group was to use either LO Draw or
> Indesign to export the text from Pagemaker.  I'll look into LO Draw
> again, but I don't have access to an older version of Indesign (the
> pagemaker import was removed in CS6). 
>
>
> On Mon, May 13, 2019 at 10:40 AM Michael H  > wrote:
>
> I unzipped the pagemaker file, and when I open
> NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can 'find' all
> of the book names, and see the text there.  
>
> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and
> open it with a zip archive progeram.  The text is in the Pagemaker
> file at the top level of the archive, but encoded with a lot of
> extraneous information.  (The English text "Matthew" appears at
> hex location 7A76972). 
>
> When I open the fonts with fontforge, Fontforge suggests the fonts
> are encoded as unicode (but the glyphs are obviously not in the
> right spot.) 
> However when I copy the text (I copied from LO Draw) and paste it
> into jedit and save that as unicode: Reopening the file has a
> warning 'not unicode, text may be missing'. 
>
> So, what this means is that there are some glyphs encoded into
> locations that unicode treats as control or non-printing codes.
> The text needs to be dealt with as a specific encoding that
> matches whatever the original font actually uses. I haven't
> figured out what the original text files were encoded with.
> Without that knowledge, I'm not sure my system clipboard or editor
> (jedit) will properly respect the glyphs in unusual locations
> until the conversion to unicode, and I don't trust myself to be
> able to detect if it is or is not properly converted. 
>
> On Mon, May 13, 2019 at 10:11 AM Cyrille  > wrote:
>
> David,
> Probably you are right about TECkit
> 
> ,
> if we get the text it will help us to convert in UNICODE.
> About how to get the text, your method is out of my skills :)
> I you succeed please let me know.
>
> Il 

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Michael H
Cyrille, (Peter),

Maybe further discussion on this belongs in Gitlab as issues.  Can I get
added to this project?

Here are the first few lines of Matthew copied from the PDF:
--
OD; {0Ha*vdusrf;
The Gospel According to Matthew
ed'gef;
usr;f ûyy*k Kd¾v f  rf maw;O;D \b0rwS wf r;f
usr;f ûyy*k Kd¾v f  rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf
z;O;D \om;jzp\f / (rmu k2;14)
olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D

-
And here are the first few lines of Matthew copied from the Pagemaker file:
-
Sifrmaw;OD; {0Ha*vdusrf;
The Gospel According to Matthew
ed'gef;
usrf;�yyk*�dKvf  OD;\b0rSwfwrf;
usrf;�yyk*�dKvf  OD;onf  *gavav;,e,frS *sL;vlrsKd;
tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/ (vk
5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/ olonf
wdab;,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;


You can see that some letters have changed, and some others are in a
different order.

The letters that change are likely those points that aren't compatible with
unicode, and pagemaker reassigned them to ensure that the file is more
widely viewable. Since a conversion is already planned, these won't matter
as much, but the font embedded in the PDF is different than the font
attached to the pagemaker file,  If you do start from the PDF, you'll need
to extract the font to get the code points.

The problem is that the PDF export from pagemaker sorts the letters into
the order they appear on the page.  Burmese text has Indian style
ligatures, where vowels tend to jump over or under the previous letters,
sometimes back 2 or three letters. If you study the following snippets from
the beginning of Matthew, you can see there is a difference in order, as
well as some glyphs are modified.

So, from the PDF letters are out of order, but from Pagemaker, letters are
encoded into control points. Fixing the control points is easy and happens
with the unicode conversion.  Fixing the letter order is not easy. You'll
need a first language speaker and plenty of time.

The guidance I received on another group was to use either LO Draw or
Indesign to export the text from Pagemaker.  I'll look into LO Draw again,
but I don't have access to an older version of Indesign (the pagemaker
import was removed in CS6).


On Mon, May 13, 2019 at 10:40 AM Michael H  wrote:

> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker
> (10.1mb), with a Hex editor, I can 'find' all of the book names, and see
> the text there.
>
> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open it
> with a zip archive progeram.  The text is in the Pagemaker file at the top
> level of the archive, but encoded with a lot of extraneous information.
> (The English text "Matthew" appears at hex location 7A76972).
>
> When I open the fonts with fontforge, Fontforge suggests the fonts are
> encoded as unicode (but the glyphs are obviously not in the right spot.)
> However when I copy the text (I copied from LO Draw) and paste it into
> jedit and save that as unicode: Reopening the file has a warning 'not
> unicode, text may be missing'.
>
> So, what this means is that there are some glyphs encoded into locations
> that unicode treats as control or non-printing codes. The text needs to be
> dealt with as a specific encoding that matches whatever the original font
> actually uses. I haven't figured out what the original text files were
> encoded with. Without that knowledge, I'm not sure my system clipboard or
> editor (jedit) will properly respect the glyphs in unusual locations until
> the conversion to unicode, and I don't trust myself to be able to detect if
> it is or is not properly converted.
>
> On Mon, May 13, 2019 at 10:11 AM Cyrille  wrote:
>
>> David,
>> Probably you are right about TECkit
>> ,
>> if we get the text it will help us to convert in UNICODE.
>> About how to get the text, your method is out of my skills :)
>> I you succeed please let me know.
>>
>> Il 13/05/2019 16:21, David Haslam ha scritto:
>>
>> Given the insights from Michael Hart, it may be feasible to temporarily
>> rearrange the main text stream as follows :
>>
>> 1. Replace every EOL by a horizontal tab.
>> 2. Insert an EOL after each verse end character.
>>
>> Observe that the above two steps are wholly reversible such that the
>> original text stream can be restored later.
>>
>> In effect the text stream is now in verse per line (VPL) layout, albeit
>> without verse tags. Some adjustments may be necessary if there any section
>> headings, etc.
>>
>> 3. Add line numbers with the first number being reset to 1 at the start
>> of each chapter, numbers incrementing by 1 for each line.
>> 4. Add a left margin USFM verse tag \v_
>>
>> Steps 3&4 can be implemented in various ways. For my part, I’d use a
>> bespoke TextPipe filter.
>>

[sword-devel] Commentary module -- no text output (RESOLVED)

2019-05-14 Thread Christopher Adams
I'm not sure how, but I seem to have missed the e-mail of the response to
my last post. So I apologize that this post isn't connected with it, and is
a bit late.

Troy, and David, thank-you for your responses; they were very helpful!

Also, just for completeness, I'll add that I wasn't sure what file to point
to in the '.conf' file, but it turned out to be just the folder, not any
one file. So the whole line is:

DataPath=./modules/comments/rawcom/test

Thanks again, everyone!
-Chris Adams.


Virus-free.
www.avg.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
___
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Bible in Myanmar

2019-05-14 Thread Cyrille
Yesterday I thought, if a pdf tool give the possibility to cut the pdf
in the middle, then the raw conversion to txt can be possible, the we
only need to convert it to UTF8.
Any idea?

Il 13/05/2019 17:40, Michael H ha scritto:
> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker
> (10.1mb), with a Hex editor, I can 'find' all of the book names, and
> see the text there.  
>
> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open
> it with a zip archive progeram.  The text is in the Pagemaker file at
> the top level of the archive, but encoded with a lot of extraneous
> information.  (The English text "Matthew" appears at hex location
> 7A76972). 
>
> When I open the fonts with fontforge, Fontforge suggests the fonts are
> encoded as unicode (but the glyphs are obviously not in the right spot.) 
> However when I copy the text (I copied from LO Draw) and paste it into
> jedit and save that as unicode: Reopening the file has a warning 'not
> unicode, text may be missing'. 
>
> So, what this means is that there are some glyphs encoded into
> locations that unicode treats as control or non-printing codes. The
> text needs to be dealt with as a specific encoding that matches
> whatever the original font actually uses. I haven't figured out what
> the original text files were encoded with. Without that knowledge, I'm
> not sure my system clipboard or editor (jedit) will properly respect
> the glyphs in unusual locations until the conversion to unicode, and I
> don't trust myself to be able to detect if it is or is not properly
> converted. 
>
> On Mon, May 13, 2019 at 10:11 AM Cyrille  > wrote:
>
> David,
> Probably you are right about TECkit
> ,
> if we get the text it will help us to convert in UNICODE.
> About how to get the text, your method is out of my skills :)
> I you succeed please let me know.
>
> Il 13/05/2019 16:21, David Haslam ha scritto:
>> Given the insights from Michael Hart, it may be feasible to
>> temporarily rearrange the main text stream as follows :
>>
>> 1. Replace every EOL by a horizontal tab. 
>> 2. Insert an EOL after each verse end character. 
>>
>> Observe that the above two steps are wholly reversible such that
>> the original text stream can be restored later. 
>>
>> In effect the text stream is now in verse per line (VPL) layout,
>> albeit without verse tags. Some adjustments may be necessary if
>> there any section headings, etc. 
>>
>> 3. Add line numbers with the first number being reset to 1 at the
>> start of each chapter, numbers incrementing by 1 for each line. 
>> 4. Add a left margin USFM verse tag \v_
>>
>> Steps 3&4 can be implemented in various ways. For my part, I’d
>> use a bespoke TextPipe filter. 
>>
>> Another method to consider might be to use Excel formulae. I
>> recall resorting to such a method in the early days of Go Bible. 
>>
>> Now restore the original layout by reverting steps 2 & 1, if this
>> is really necessary. That is, if the original text layout
>> appeared to be paragraphed. 
>>
>> 5. Decide how & where to insert paragraph tags. 
>>
>> 6. Add chapter tags, book ID and main title tags, etc. 
>>
>> Hope this gives some useful suggestions that point towards a
>> practical solution. 
>>
>> Best regards 
>>
>> David
>>
>>
>> Sent from ProtonMail Mobile
>>
>>
>> On Mon, May 13, 2019 at 14:57, Michael H > > wrote:
>>> Cyrille
>>>
>>> LibreOffice Draw attempts to open the pagemaker file, with
>>> limited success. But it confirms that even in the pagemaker
>>> source, the verse numbers are a separate text stream. With this
>>> source, there is no way to copy the text with verse numbers
>>> intact. It appears to be stored with each book in it's own text
>>> stream. Each book is a separate text stream in the page maker
>>> file. LO Draw isn't rendering all of the pages, only the first
>>> 10, So I've only explored Matthew further. 
>>>
>>> Based on Matthew only, the verses seem to all end with the
>>> character "-" or ";/", which should aid in the reconstruction.
>>> I've looked through the PDF and this seems to be the case for
>>> all books visually as well. However, this isn't perfect: I find
>>> 1107 of these characters in Matthew, instead of the expected
>>> 1071 verses.  But since the text stream has a book introduction,
>>> this is likely easily explained. Hopefully this gets you well
>>> down the path to creating a stream with verses. 
>>>
>>> I would NOT start from the PDF file, but from the pagemaker
>>> file.  The PDF almost certainly has a lot of text rearranging
>>> and extra characters like page numbers and running heads. 
>>> Pagemaker has the book text in a single