Re: Spurious characters from html files - text encoding issues?

Keith Clarke via use-livecode Fri, 21 May 2021 05:21:59 -0700

Hi Ben,
Thanks for the further details and tips - my problem is now solved!


The BBedit tip re file 'open-as UTF-8' was a great help. I’d not noticed these 
options before (as I tend to open files from PathFinder folder lists not via 
apps). However, this did indeed reveal format errors on these cache files when 
they were saved with the raw (UTF-8 confirmed) htmltext of widget “browser”. 
Text encoding to UTF-8 before saving fixed this issue and re-crawling the 
source pages has resulted in files that BBEdit recognises as ‘regular’ UTF-8.

This reduced the anomaly count but whilst testing, I also noticed that the 
read-write cycle updating the output csv file was spawning anomalies and 
expanding those already present. So I wrapped this function to also force UTF-8 
decoding/encoding - and now all is now good.

No longer will I assume that a simple text file is a simple text file! :-)

Thanks & regards,
Keith 

> On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode 
> <[email protected]> wrote:
> 
> Hi Keith,
> 
> This might need input from the mothership, but I think if you've obtained the 
> text from the browser widget's htmlText, it will probably be in the special 
> 'internal' format. I'm not entirely sure what happens when you save that as 
> text - I suspect it depends on the platform.
> 
> So for clarity (if you have the opportunity to re-save this material; and if 
> it won't confuse things because existing files are in one format, and new 
> ones another) it would probably be best to textEncode it into UTF-8, then 
> save it as binfile. That way the files on disk should be UTF-8, which is 
> something like a standard.
> 
> What I tend to do in this situation where I have text files and I'm not sure 
> what the format is (and I spend quite a lot of time messing with text files 
> from various sources, some unknown and many not under my control) is use a 
> good text editor - I use BBedit on Mac, not sure what suitable alternatives 
> would be on Windows or Linux - to investigate the file. BBEdit makes a guess 
> when it opens the file, but allows you to try re-opening in different 
> encodings, and then warns you if there are byte sequences that don't make 
> sense with that encoding. So by doing this I can often figure out what the 
> encoding of the file is - once you've got that, you're off to the races.
> 
> But if you have the opportunity to re-collect the whole set, then I *think* 
> the above formula of textEncoding from LC's internal format to UTF-8, then 
> saving as binary file; and reversing the process when you load them back in 
> to process; and then doing the same again - possibly to a different format - 
> when you output the CSV, should see you clear.
> 
> HTH,
> 
> Ben
> 
> 
> On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:
>> Thanks Ben, that’s really interesting. It never occurred to me that these 
>> html files might be anything other than simple plain text files, as I’d work 
>> with in Coda, etc., for years.
>> The local HTML files are storage of the HTML text pulled from the LiveCode 
>> browser widget, saved using the URL ‘file:’ option. I’d been working ‘live’ 
>> from the Browser widget’s html text until recently, when I’ve introduced 
>> these local files to split page ‘crawling’ and analysis activities without 
>> needing a database.
>> Reading the files back into LiveCode with the URL ‘file:’ option works quite 
>> happily with no text anomalies when put into a field to read. The problem 
>> seems to arise when I load the HTML text into a variable and then start to 
>> extract elements using LiveCode's text chunking. For example pulling the 
>> text between the offsets of say <p> & </p> tags is when these character 
>> anomalies have started to pop into the strings.
>> A quick test on reading in the local HTML files with the URL ‘binfile:’ 
>> option and then textDecode(tString, “UTF-8”) seems to reduce the frequency 
>> and size of anomalies, but some remain. So, I’ll see if re-crawling pages 
>> and saving the HTML text from the browser widget as binfiles reduces this 
>> further.
>> Thanks & regards,
>> Keith
>>> On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode 
>>> <[email protected]> wrote:
>>> 
>>> Hi Keith,
>>> 
>>> The thing with character encoding is that you always need to know where 
>>> it's coming from and where it's going.
>>> 
>>> Do you know how the HTML documents were obtained? Saved from a browser, 
>>> fetched by curl, fetched by Livecode? Or generated on disk by something 
>>> else?
>>> 
>>> If it was saved from a browser or fetched by curl, then the format is most 
>>> likely to be UTF-8. In order to see it correctly in LiveCode, you'd need to 
>>> two things:
>>>     - read it in as a binary file, rather than text (e.g. use URL 
>>> "binfile://..." or "open file ... for binary read")
>>>     - convert it to the internal text format FROM UTF-8 - which means use 
>>> textDecode(tString, "UTF-8"), rather than textEncode
>>> 
>>> If it was fetched by LiveCode, then it most likely arrived over the wire as 
>>> UTF-8, but if it was saved by LiveCode as text (not binary) then it _may_ 
>>> have got corrupted.
>>> 
>>> If you can see the text looking as you expect in LiveCode, you've solved 
>>> half the problem. Then you need to consider where it's going: who (that) is 
>>> going to consume the CSV. This is the time to use textEncode, and then be 
>>> sure to save it as a binary file. If the consumer will be something 
>>> reasonably modern, then again UTF-8 is a good default. If it's something 
>>> much older, you might need to use "CP1252" or similar.
>>> 
>>> HTH,
>>> 
>>> Ben
>>> 
>>> 
>>> On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:
>>>> Hi folks,
>>>> I’m using LiveCode to summarise text from HTML documents into csv summary 
>>>> files and am noticing that when I extract strings from html documents 
>>>> stored on disk - rather than visiting the sites via the browser widget & 
>>>> grabbing the HTML text - weird characters being inserted in place of what 
>>>> appear to be ‘regular’ characters.
>>>> The number of characters inserted can run into the thousands per instance, 
>>>> making my csv ‘summary’ file run into gigabytes! Has anyone seen the 
>>>> following type of string before, happen to know what might be causing it 
>>>> and offer a fix?
>>>> ‚Äö√Ñ√∂‚àö√ë‚àö‚àÇ‚Äö√†√∂‚àö√´‚Äö√†√∂‚Äö√†√á‚Äö√Ñ√∂‚àö‚Ä†‚àö‚àÇ‚Äö√†√∂‚àö¬¥‚Äö√Ñ√∂‚àö‚Ä†‚àö‚àÇ‚Äö√Ñ√∂‚àö‚Ä†‚àö√°
>>>> I’ve tried deliberately setting UTF-8 on the extracted strings, with put 
>>>> textEncode(tString, "UTF-8") into tString. Currently I’m not attempting to 
>>>> force any text format on the local HTML documents.
>>>> Thanks & regards,
>>>> Keith
>>>> _______________________________________________
>>>> use-livecode mailing list
>>>> [email protected]
>>>> Please visit this url to subscribe, unsubscribe and manage your 
>>>> subscription preferences:
>>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>> 
>>> _______________________________________________
>>> use-livecode mailing list
>>> [email protected]
>>> Please visit this url to subscribe, unsubscribe and manage your 
>>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> _______________________________________________
>> use-livecode mailing list
>> [email protected]
>> Please visit this url to subscribe, unsubscribe and manage your subscription 
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> _______________________________________________
> use-livecode mailing list
> [email protected]
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Spurious characters from html files - text encoding issues?

Reply via email to