Hi Ben, Thanks for the further details and tips - my problem is now solved!
The BBedit tip re file 'open-as UTF-8' was a great help. I’d not noticed these options before (as I tend to open files from PathFinder folder lists not via apps). However, this did indeed reveal format errors on these cache files when they were saved with the raw (UTF-8 confirmed) htmltext of widget “browser”. Text encoding to UTF-8 before saving fixed this issue and re-crawling the source pages has resulted in files that BBEdit recognises as ‘regular’ UTF-8. This reduced the anomaly count but whilst testing, I also noticed that the read-write cycle updating the output csv file was spawning anomalies and expanding those already present. So I wrapped this function to also force UTF-8 decoding/encoding - and now all is now good. No longer will I assume that a simple text file is a simple text file! :-) Thanks & regards, Keith > On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode > <[email protected]> wrote: > > Hi Keith, > > This might need input from the mothership, but I think if you've obtained the > text from the browser widget's htmlText, it will probably be in the special > 'internal' format. I'm not entirely sure what happens when you save that as > text - I suspect it depends on the platform. > > So for clarity (if you have the opportunity to re-save this material; and if > it won't confuse things because existing files are in one format, and new > ones another) it would probably be best to textEncode it into UTF-8, then > save it as binfile. That way the files on disk should be UTF-8, which is > something like a standard. > > What I tend to do in this situation where I have text files and I'm not sure > what the format is (and I spend quite a lot of time messing with text files > from various sources, some unknown and many not under my control) is use a > good text editor - I use BBedit on Mac, not sure what suitable alternatives > would be on Windows or Linux - to investigate the file. BBEdit makes a guess > when it opens the file, but allows you to try re-opening in different > encodings, and then warns you if there are byte sequences that don't make > sense with that encoding. So by doing this I can often figure out what the > encoding of the file is - once you've got that, you're off to the races. > > But if you have the opportunity to re-collect the whole set, then I *think* > the above formula of textEncoding from LC's internal format to UTF-8, then > saving as binary file; and reversing the process when you load them back in > to process; and then doing the same again - possibly to a different format - > when you output the CSV, should see you clear. > > HTH, > > Ben > > > On 17/05/2021 15:58, Keith Clarke via use-livecode wrote: >> Thanks Ben, that’s really interesting. It never occurred to me that these >> html files might be anything other than simple plain text files, as I’d work >> with in Coda, etc., for years. >> The local HTML files are storage of the HTML text pulled from the LiveCode >> browser widget, saved using the URL ‘file:’ option. I’d been working ‘live’ >> from the Browser widget’s html text until recently, when I’ve introduced >> these local files to split page ‘crawling’ and analysis activities without >> needing a database. >> Reading the files back into LiveCode with the URL ‘file:’ option works quite >> happily with no text anomalies when put into a field to read. The problem >> seems to arise when I load the HTML text into a variable and then start to >> extract elements using LiveCode's text chunking. For example pulling the >> text between the offsets of say <p> & </p> tags is when these character >> anomalies have started to pop into the strings. >> A quick test on reading in the local HTML files with the URL ‘binfile:’ >> option and then textDecode(tString, “UTF-8”) seems to reduce the frequency >> and size of anomalies, but some remain. So, I’ll see if re-crawling pages >> and saving the HTML text from the browser widget as binfiles reduces this >> further. >> Thanks & regards, >> Keith >>> On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode >>> <[email protected]> wrote: >>> >>> Hi Keith, >>> >>> The thing with character encoding is that you always need to know where >>> it's coming from and where it's going. >>> >>> Do you know how the HTML documents were obtained? Saved from a browser, >>> fetched by curl, fetched by Livecode? Or generated on disk by something >>> else? >>> >>> If it was saved from a browser or fetched by curl, then the format is most >>> likely to be UTF-8. In order to see it correctly in LiveCode, you'd need to >>> two things: >>> - read it in as a binary file, rather than text (e.g. use URL >>> "binfile://..." or "open file ... for binary read") >>> - convert it to the internal text format FROM UTF-8 - which means use >>> textDecode(tString, "UTF-8"), rather than textEncode >>> >>> If it was fetched by LiveCode, then it most likely arrived over the wire as >>> UTF-8, but if it was saved by LiveCode as text (not binary) then it _may_ >>> have got corrupted. >>> >>> If you can see the text looking as you expect in LiveCode, you've solved >>> half the problem. Then you need to consider where it's going: who (that) is >>> going to consume the CSV. This is the time to use textEncode, and then be >>> sure to save it as a binary file. If the consumer will be something >>> reasonably modern, then again UTF-8 is a good default. If it's something >>> much older, you might need to use "CP1252" or similar. >>> >>> HTH, >>> >>> Ben >>> >>> >>> On 17/05/2021 09:28, Keith Clarke via use-livecode wrote: >>>> Hi folks, >>>> I’m using LiveCode to summarise text from HTML documents into csv summary >>>> files and am noticing that when I extract strings from html documents >>>> stored on disk - rather than visiting the sites via the browser widget & >>>> grabbing the HTML text - weird characters being inserted in place of what >>>> appear to be ‘regular’ characters. >>>> The number of characters inserted can run into the thousands per instance, >>>> making my csv ‘summary’ file run into gigabytes! Has anyone seen the >>>> following type of string before, happen to know what might be causing it >>>> and offer a fix? >>>> ‚Äö√Ñ√∂‚àö√ë‚àö‚àÇ‚Äö√†√∂‚àö√´‚Äö√†√∂‚Äö√†√á‚Äö√Ñ√∂‚àö‚Ć‚àö‚àÇ‚Äö√†√∂‚àö¬¥‚Äö√Ñ√∂‚àö‚Ć‚àö‚àÇ‚Äö√Ñ√∂‚àö‚Ć‚àö√° >>>> I’ve tried deliberately setting UTF-8 on the extracted strings, with put >>>> textEncode(tString, "UTF-8") into tString. Currently I’m not attempting to >>>> force any text format on the local HTML documents. >>>> Thanks & regards, >>>> Keith >>>> _______________________________________________ >>>> use-livecode mailing list >>>> [email protected] >>>> Please visit this url to subscribe, unsubscribe and manage your >>>> subscription preferences: >>>> http://lists.runrev.com/mailman/listinfo/use-livecode >>> >>> _______________________________________________ >>> use-livecode mailing list >>> [email protected] >>> Please visit this url to subscribe, unsubscribe and manage your >>> subscription preferences: >>> http://lists.runrev.com/mailman/listinfo/use-livecode >> _______________________________________________ >> use-livecode mailing list >> [email protected] >> Please visit this url to subscribe, unsubscribe and manage your subscription >> preferences: >> http://lists.runrev.com/mailman/listinfo/use-livecode > > _______________________________________________ > use-livecode mailing list > [email protected] > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode _______________________________________________ use-livecode mailing list [email protected] Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
