Thanks for posting these.
The later one (https://quality.livecode.com/show_bug.cgi?id=12205) I was
already following because I think I raised the issue originally and Mark
kindly added a bug entry. The former I was unaware, but would also be a
convenient enhancement - especially along with a built-in
'guessEncoding' function.
On 5/31/2021 8:39 AM, Ben Rubinstein via use-livecode wrote:
Also relevant enhancement requests:
https://quality.livecode.com/show_bug.cgi?id=13581
https://quality.livecode.com/show_bug.cgi?id=12205
On 21/05/2021 15:57, Paul Dupuis via use-livecode wrote:
BBEdit has a built in "guess encoding" function to try to determine
the encoding of a text file.
I have had this bug in to LC now for 6 years:
https://quality.livecode.com/show_bug.cgi?id=14474
Even Frasier, who did much of the Unicode work for LC7 agreed there
should be a guessEncoding function in Livecode. Instead, anyone who
needs one either has to write their own or find someone who has
written one to get one from.
While you can never tell with 100% accurate the encoding for all text
files, there are algorithms that make pretty good guesses. I'd still
like to see it as a build in function in the LC engine.
On 5/21/2021 8:19 AM, Keith Clarke via use-livecode wrote:
Hi Ben,
Thanks for the further details and tips - my problem is now solved!
The BBedit tip re file 'open-as UTF-8' was a great help. I’d not
noticed these options before (as I tend to open files from
PathFinder folder lists not via apps). However, this did indeed
reveal format errors on these cache files when they were saved with
the raw (UTF-8 confirmed) htmltext of widget “browser”. Text
encoding to UTF-8 before saving fixed this issue and re-crawling the
source pages has resulted in files that BBEdit recognises as
‘regular’ UTF-8.
This reduced the anomaly count but whilst testing, I also noticed
that the read-write cycle updating the output csv file was spawning
anomalies and expanding those already present. So I wrapped this
function to also force UTF-8 decoding/encoding - and now all is now
good.
No longer will I assume that a simple text file is a simple text
file! :-)
Thanks & regards,
Keith
On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode
<use-livecode@lists.runrev.com> wrote:
Hi Keith,
This might need input from the mothership, but I think if you've
obtained the text from the browser widget's htmlText, it will
probably be in the special 'internal' format. I'm not entirely sure
what happens when you save that as text - I suspect it depends on
the platform.
So for clarity (if you have the opportunity to re-save this
material; and if it won't confuse things because existing files are
in one format, and new ones another) it would probably be best to
textEncode it into UTF-8, then save it as binfile. That way the
files on disk should be UTF-8, which is something like a standard.
What I tend to do in this situation where I have text files and I'm
not sure what the format is (and I spend quite a lot of time
messing with text files from various sources, some unknown and many
not under my control) is use a good text editor - I use BBedit on
Mac, not sure what suitable alternatives would be on Windows or
Linux - to investigate the file. BBEdit makes a guess when it opens
the file, but allows you to try re-opening in different encodings,
and then warns you if there are byte sequences that don't make
sense with that encoding. So by doing this I can often figure out
what the encoding of the file is - once you've got that, you're off
to the races.
But if you have the opportunity to re-collect the whole set, then I
*think* the above formula of textEncoding from LC's internal format
to UTF-8, then saving as binary file; and reversing the process
when you load them back in to process; and then doing the same
again - possibly to a different format - when you output the CSV,
should see you clear.
HTH,
Ben
On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:
Thanks Ben, that’s really interesting. It never occurred to me
that these html files might be anything other than simple plain
text files, as I’d work with in Coda, etc., for years.
The local HTML files are storage of the HTML text pulled from the
LiveCode browser widget, saved using the URL ‘file:’ option. I’d
been working ‘live’ from the Browser widget’s html text until
recently, when I’ve introduced these local files to split page
‘crawling’ and analysis activities without needing a database.
Reading the files back into LiveCode with the URL ‘file:’ option
works quite happily with no text anomalies when put into a field
to read. The problem seems to arise when I load the HTML text into
a variable and then start to extract elements using LiveCode's
text chunking. For example pulling the text between the offsets of
say <p> & </p> tags is when these character anomalies have started
to pop into the strings.
A quick test on reading in the local HTML files with the URL
‘binfile:’ option and then textDecode(tString, “UTF-8”) seems to
reduce the frequency and size of anomalies, but some remain. So,
I’ll see if re-crawling pages and saving the HTML text from the
browser widget as binfiles reduces this further.
Thanks & regards,
Keith
On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode
<use-livecode@lists.runrev.com> wrote:
Hi Keith,
The thing with character encoding is that you always need to know
where it's coming from and where it's going.
Do you know how the HTML documents were obtained? Saved from a
browser, fetched by curl, fetched by Livecode? Or generated on
disk by something else?
If it was saved from a browser or fetched by curl, then the
format is most likely to be UTF-8. In order to see it correctly
in LiveCode, you'd need to two things:
- read it in as a binary file, rather than text (e.g. use URL
"binfile://..." or "open file ... for binary read")
- convert it to the internal text format FROM UTF-8 - which
means use textDecode(tString, "UTF-8"), rather than textEncode
If it was fetched by LiveCode, then it most likely arrived over
the wire as UTF-8, but if it was saved by LiveCode as text (not
binary) then it _may_ have got corrupted.
If you can see the text looking as you expect in LiveCode, you've
solved half the problem. Then you need to consider where it's
going: who (that) is going to consume the CSV. This is the time
to use textEncode, and then be sure to save it as a binary file.
If the consumer will be something reasonably modern, then again
UTF-8 is a good default. If it's something much older, you might
need to use "CP1252" or similar.
HTH,
Ben
On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:
Hi folks,
I’m using LiveCode to summarise text from HTML documents into
csv summary files and am noticing that when I extract strings
from html documents stored on disk - rather than visiting the
sites via the browser widget & grabbing the HTML text - weird
characters being inserted in place of what appear to be
‘regular’ characters.
The number of characters inserted can run into the thousands per
instance, making my csv ‘summary’ file run into gigabytes! Has
anyone seen the following type of string before, happen to know
what might be causing it and offer a fix?
‚Äö
I’ve tried deliberately setting UTF-8 on the extracted strings,
with put textEncode(tString, "UTF-8") into tString. Currently
I’m not attempting to force any text format on the local HTML
documents.
Thanks & regards,
Keith
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode