I have *.html files that were extracted from epub documents.

At our instruction, the company preparing these epubs was requested to use 
Unicode throughout and avoid all ANSI chars (i.e. us the Unicode for mdash and 
not — from the mac keyboard)

If I drop these on any browser they are perfectly rendered.

(see on line example here:

http://www.himalayanacademy.com/media/books/living-with-siva/web/part1_06.html)



When I open these as UTF-8 unicode files in BBEdit.  I see the characters like 
this:

The knowl­edge of how to realize this one­ness and not create 
un­wanted ex­­periences along the way. The peerless path is 
following the way of our spiritual forefathers, discovering the mystical 
meaning of the scrip­tures.

To make life interesting, some files were exported with the decimal form and 
other with the hex form (”) some of the files mix the hex style notation 
along side the decimal.. in the same paragraph block… browser doesn't seem to 
care about that, so we are getting ­   and then later in the same 
paragraph: "¶


which is also exactly what you see if you look at source in the browser for the 
page.

I need to be able to target, for search and replace, a subset of these 
characters. In particular, we have troublesome

1) Pilcrow (Paragraph Sign)  ¶  #old "run on" paragraph mark… which we 
want to turn into block paragraph form: just have an extra blank line.

2) discretionary hyphens coming through from InDesign all the way into the HTML 
files which are ignore in browsers, but appear as dashes/hyphens in Livecode 
fields after textDecode… ­  or  sometimes appear after I run scripts as: 
"­"


3) the old ligature double characters, in particule fi   "fi" tied together. 
fi In Livecode, this just disappeard… instead "we find" I see "we nd" in 
the field.  I suspect the character is actually there, but the font can't 
render it,so the LC field is just white space in that location.

fulfillment


Try as I might, I am unable to target these characters viaLC script…it's as if 
what I see in BBEdit and in the source of the page in the browser is not what 
gets imported using "put url " … in any form… in LC.  using

put url ("file:/" & "part1_ch3.html") into tText ## no search and replace I 
attempt will work
using binfile or textDecode  before insertion into the variable also doesn’t 
work.

Here is my script (latest iteration on many variants all of which don't get the 
job done)



global gBookFilesLocation
on mouseup
palette this stack
set the defaultstack to the topstack
put fld "TOC" into tToc
repeat for each line x in tToc
set the itemdel to "."
put item 1 of x & "-clean.txt" into tFilename
put the uBookFilesLocation of this stack into tPath
#put url ("file:" & tPath & "/ops/xhtml/" & x ) into tText
put url ("binfile:" & tPath & "/ops/xhtml/" & x ) into tText
put textDecode(tText,"UTF-8") into tNewText
# check for discretionary hyphens
if tNewText contains "­" then
answer "Yes got unicode soft hyphen string" with "OK"
exit to top # this never fires ...
end if

## Attempts to replace here:
replace "¶" with (cr & cr) in tNewText  # never happens;may need to use 
"</p>"&cr&"<p>"
replace "&#xB6;" with (cr&cr) in tNewText # Pilcrow sign for the above char: 
U+00B6 old name "Paragraph Sign" --we want an extra blank line.
replace "&#173;" with "" in tNewText # soft hyphen
replace "&#xAD;" with "" in tNewText # Soft hyphen - replace with nothing
replace "<br/>" with " " in tNewText
replace "<br />" with " " in tNewText
replace "<br>" with " " in tNewText
replace " " with " " in tNewText # remove double spaces create by breaks 
following a space.

put textDecode(tNewText,"UTF-8") into tNewTextOut
set the htmlText of fld "CurrentChapterText" to tNewTextOut
put textEncode(fld "CurrentChapterText","UTF-8") into url ("binfile:" & 
gBookFilesLocation &"/_cleanExport1/" & tFilename)
end repeat
end mouseup

I tried, in another text processing stack I have, just to go direct from file 
into script/vars and then out to file without going thru a field… that's no 
better.

Any clues on how to approach this?

Brahmanathaswami



_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to