Re: Spurious characters from html files - text encoding issues?

2021-05-31 Thread Paul Dupuis via use-livecode

Thanks for posting these.

The later one (https://quality.livecode.com/show_bug.cgi?id=12205) I was 
already following because I think I raised the issue originally and Mark 
kindly added a bug entry. The former I was unaware, but would also be a 
convenient enhancement - especially along with a built-in 
'guessEncoding' function.


On 5/31/2021 8:39 AM, Ben Rubinstein via use-livecode wrote:

Also relevant enhancement requests:
https://quality.livecode.com/show_bug.cgi?id=13581
https://quality.livecode.com/show_bug.cgi?id=12205

On 21/05/2021 15:57, Paul Dupuis via use-livecode wrote:
BBEdit has a built in "guess encoding" function to try to determine 
the encoding of a text file.


I have had this bug in to LC now for 6 years: 
https://quality.livecode.com/show_bug.cgi?id=14474


Even Frasier, who did much of the Unicode work for LC7 agreed there 
should be a guessEncoding function in Livecode. Instead, anyone who 
needs one either has to write their own or find someone who has 
written one to get one from.


While you can never tell with 100% accurate the encoding for all text 
files, there are algorithms that make pretty good guesses. I'd still 
like to see it as a build in function in the LC engine.



On 5/21/2021 8:19 AM, Keith Clarke via use-livecode wrote:

Hi Ben,
Thanks for the further details and tips - my problem is now solved!

The BBedit tip re file 'open-as UTF-8' was a great help. I’d not 
noticed these options before (as I tend to open files from 
PathFinder folder lists not via apps). However, this did indeed 
reveal format errors on these cache files when they were saved with 
the raw (UTF-8 confirmed) htmltext of widget “browser”. Text 
encoding to UTF-8 before saving fixed this issue and re-crawling the 
source pages has resulted in files that BBEdit recognises as 
‘regular’ UTF-8.


This reduced the anomaly count but whilst testing, I also noticed 
that the read-write cycle updating the output csv file was spawning 
anomalies and expanding those already present. So I wrapped this 
function to also force UTF-8 decoding/encoding - and now all is now 
good.


No longer will I assume that a simple text file is a simple text 
file! :-)


Thanks & regards,
Keith

On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode 
 wrote:


Hi Keith,

This might need input from the mothership, but I think if you've 
obtained the text from the browser widget's htmlText, it will 
probably be in the special 'internal' format. I'm not entirely sure 
what happens when you save that as text - I suspect it depends on 
the platform.


So for clarity (if you have the opportunity to re-save this 
material; and if it won't confuse things because existing files are 
in one format, and new ones another) it would probably be best to 
textEncode it into UTF-8, then save it as binfile. That way the 
files on disk should be UTF-8, which is something like a standard.


What I tend to do in this situation where I have text files and I'm 
not sure what the format is (and I spend quite a lot of time 
messing with text files from various sources, some unknown and many 
not under my control) is use a good text editor - I use BBedit on 
Mac, not sure what suitable alternatives would be on Windows or 
Linux - to investigate the file. BBEdit makes a guess when it opens 
the file, but allows you to try re-opening in different encodings, 
and then warns you if there are byte sequences that don't make 
sense with that encoding. So by doing this I can often figure out 
what the encoding of the file is - once you've got that, you're off 
to the races.


But if you have the opportunity to re-collect the whole set, then I 
*think* the above formula of textEncoding from LC's internal format 
to UTF-8, then saving as binary file; and reversing the process 
when you load them back in to process; and then doing the same 
again - possibly to a different format - when you output the CSV, 
should see you clear.


HTH,

Ben


On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:
Thanks Ben, that’s really interesting. It never occurred to me 
that these html files might be anything other than simple plain 
text files, as I’d work with in Coda, etc., for years.
The local HTML files are storage of the HTML text pulled from the 
LiveCode browser widget, saved using the URL ‘file:’ option. I’d 
been working ‘live’ from the Browser widget’s html text until 
recently, when I’ve introduced these local files to split page 
‘crawling’ and analysis activities without needing a database.
Reading the files back into LiveCode with the URL ‘file:’ option 
works quite happily with no text anomalies when put into a field 
to read. The problem seems to arise when I load the HTML text into 
a variable and then start to extract elements using LiveCode's 
text chunking. For example pulling the text between the offsets of 
say  &  tags is when these character anomalies have started 
to pop into the strings.
A quick test on reading in the local HTML files 

Re: Spurious characters from html files - text encoding issues?

2021-05-31 Thread Ben Rubinstein via use-livecode

Also relevant enhancement requests:
https://quality.livecode.com/show_bug.cgi?id=13581
https://quality.livecode.com/show_bug.cgi?id=12205

On 21/05/2021 15:57, Paul Dupuis via use-livecode wrote:
BBEdit has a built in "guess encoding" function to try to determine the 
encoding of a text file.


I have had this bug in to LC now for 6 years: 
https://quality.livecode.com/show_bug.cgi?id=14474


Even Frasier, who did much of the Unicode work for LC7 agreed there should be 
a guessEncoding function in Livecode. Instead, anyone who needs one either has 
to write their own or find someone who has written one to get one from.


While you can never tell with 100% accurate the encoding for all text files, 
there are algorithms that make pretty good guesses. I'd still like to see it 
as a build in function in the LC engine.



On 5/21/2021 8:19 AM, Keith Clarke via use-livecode wrote:

Hi Ben,
Thanks for the further details and tips - my problem is now solved!

The BBedit tip re file 'open-as UTF-8' was a great help. I’d not noticed 
these options before (as I tend to open files from PathFinder folder lists 
not via apps). However, this did indeed reveal format errors on these cache 
files when they were saved with the raw (UTF-8 confirmed) htmltext of widget 
“browser”. Text encoding to UTF-8 before saving fixed this issue and 
re-crawling the source pages has resulted in files that BBEdit recognises as 
‘regular’ UTF-8.


This reduced the anomaly count but whilst testing, I also noticed that the 
read-write cycle updating the output csv file was spawning anomalies and 
expanding those already present. So I wrapped this function to also force 
UTF-8 decoding/encoding - and now all is now good.


No longer will I assume that a simple text file is a simple text file! :-)

Thanks & regards,
Keith

On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode 
 wrote:


Hi Keith,

This might need input from the mothership, but I think if you've obtained 
the text from the browser widget's htmlText, it will probably be in the 
special 'internal' format. I'm not entirely sure what happens when you save 
that as text - I suspect it depends on the platform.


So for clarity (if you have the opportunity to re-save this material; and 
if it won't confuse things because existing files are in one format, and 
new ones another) it would probably be best to textEncode it into UTF-8, 
then save it as binfile. That way the files on disk should be UTF-8, which 
is something like a standard.


What I tend to do in this situation where I have text files and I'm not 
sure what the format is (and I spend quite a lot of time messing with text 
files from various sources, some unknown and many not under my control) is 
use a good text editor - I use BBedit on Mac, not sure what suitable 
alternatives would be on Windows or Linux - to investigate the file. BBEdit 
makes a guess when it opens the file, but allows you to try re-opening in 
different encodings, and then warns you if there are byte sequences that 
don't make sense with that encoding. So by doing this I can often figure 
out what the encoding of the file is - once you've got that, you're off to 
the races.


But if you have the opportunity to re-collect the whole set, then I *think* 
the above formula of textEncoding from LC's internal format to UTF-8, then 
saving as binary file; and reversing the process when you load them back in 
to process; and then doing the same again - possibly to a different format 
- when you output the CSV, should see you clear.


HTH,

Ben


On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:
Thanks Ben, that’s really interesting. It never occurred to me that these 
html files might be anything other than simple plain text files, as I’d 
work with in Coda, etc., for years.
The local HTML files are storage of the HTML text pulled from the LiveCode 
browser widget, saved using the URL ‘file:’ option. I’d been working 
‘live’ from the Browser widget’s html text until recently, when I’ve 
introduced these local files to split page ‘crawling’ and analysis 
activities without needing a database.
Reading the files back into LiveCode with the URL ‘file:’ option works 
quite happily with no text anomalies when put into a field to read. The 
problem seems to arise when I load the HTML text into a variable and then 
start to extract elements using LiveCode's text chunking. For example 
pulling the text between the offsets of say  &  tags is when these 
character anomalies have started to pop into the strings.
A quick test on reading in the local HTML files with the URL ‘binfile:’ 
option and then textDecode(tString, “UTF-8”) seems to reduce the frequency 
and size of anomalies, but some remain. So, I’ll see if re-crawling pages 
and saving the HTML text from the browser widget as binfiles reduces this 
further.

Thanks & regards,
Keith
On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode 
 wrote:


Hi Keith,

The thing with character encoding is that 

Re: Spurious characters from html files - text encoding issues?

2021-05-21 Thread Paul Dupuis via use-livecode
BBEdit has a built in "guess encoding" function to try to determine the 
encoding of a text file.


I have had this bug in to LC now for 6 years: 
https://quality.livecode.com/show_bug.cgi?id=14474


Even Frasier, who did much of the Unicode work for LC7 agreed there 
should be a guessEncoding function in Livecode. Instead, anyone who 
needs one either has to write their own or find someone who has written 
one to get one from.


While you can never tell with 100% accurate the encoding for all text 
files, there are algorithms that make pretty good guesses. I'd still 
like to see it as a build in function in the LC engine.



On 5/21/2021 8:19 AM, Keith Clarke via use-livecode wrote:

Hi Ben,
Thanks for the further details and tips - my problem is now solved!

The BBedit tip re file 'open-as UTF-8' was a great help. I’d not noticed these 
options before (as I tend to open files from PathFinder folder lists not via 
apps). However, this did indeed reveal format errors on these cache files when 
they were saved with the raw (UTF-8 confirmed) htmltext of widget “browser”. 
Text encoding to UTF-8 before saving fixed this issue and re-crawling the 
source pages has resulted in files that BBEdit recognises as ‘regular’ UTF-8.

This reduced the anomaly count but whilst testing, I also noticed that the 
read-write cycle updating the output csv file was spawning anomalies and 
expanding those already present. So I wrapped this function to also force UTF-8 
decoding/encoding - and now all is now good.

No longer will I assume that a simple text file is a simple text file! :-)

Thanks & regards,
Keith


On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode 
 wrote:

Hi Keith,

This might need input from the mothership, but I think if you've obtained the 
text from the browser widget's htmlText, it will probably be in the special 
'internal' format. I'm not entirely sure what happens when you save that as 
text - I suspect it depends on the platform.

So for clarity (if you have the opportunity to re-save this material; and if it 
won't confuse things because existing files are in one format, and new ones 
another) it would probably be best to textEncode it into UTF-8, then save it as 
binfile. That way the files on disk should be UTF-8, which is something like a 
standard.

What I tend to do in this situation where I have text files and I'm not sure 
what the format is (and I spend quite a lot of time messing with text files 
from various sources, some unknown and many not under my control) is use a good 
text editor - I use BBedit on Mac, not sure what suitable alternatives would be 
on Windows or Linux - to investigate the file. BBEdit makes a guess when it 
opens the file, but allows you to try re-opening in different encodings, and 
then warns you if there are byte sequences that don't make sense with that 
encoding. So by doing this I can often figure out what the encoding of the file 
is - once you've got that, you're off to the races.

But if you have the opportunity to re-collect the whole set, then I *think* the 
above formula of textEncoding from LC's internal format to UTF-8, then saving 
as binary file; and reversing the process when you load them back in to 
process; and then doing the same again - possibly to a different format - when 
you output the CSV, should see you clear.

HTH,

Ben


On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:

Thanks Ben, that’s really interesting. It never occurred to me that these html 
files might be anything other than simple plain text files, as I’d work with in 
Coda, etc., for years.
The local HTML files are storage of the HTML text pulled from the LiveCode 
browser widget, saved using the URL ‘file:’ option. I’d been working ‘live’ 
from the Browser widget’s html text until recently, when I’ve introduced these 
local files to split page ‘crawling’ and analysis activities without needing a 
database.
Reading the files back into LiveCode with the URL ‘file:’ option works quite happily with no 
text anomalies when put into a field to read. The problem seems to arise when I load the HTML 
text into a variable and then start to extract elements using LiveCode's text chunking. For 
example pulling the text between the offsets of say  &  tags is when 
these character anomalies have started to pop into the strings.
A quick test on reading in the local HTML files with the URL ‘binfile:’ option 
and then textDecode(tString, “UTF-8”) seems to reduce the frequency and size of 
anomalies, but some remain. So, I’ll see if re-crawling pages and saving the 
HTML text from the browser widget as binfiles reduces this further.
Thanks & regards,
Keith

On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode 
 wrote:

Hi Keith,

The thing with character encoding is that you always need to know where it's 
coming from and where it's going.

Do you know how the HTML documents were obtained? Saved from a browser, fetched 
by curl, fetched by Livecode? Or generated on disk by 

Re: Spurious characters from html files - text encoding issues?

2021-05-21 Thread Keith Clarke via use-livecode
Hi Ben,
Thanks for the further details and tips - my problem is now solved! 

The BBedit tip re file 'open-as UTF-8' was a great help. I’d not noticed these 
options before (as I tend to open files from PathFinder folder lists not via 
apps). However, this did indeed reveal format errors on these cache files when 
they were saved with the raw (UTF-8 confirmed) htmltext of widget “browser”. 
Text encoding to UTF-8 before saving fixed this issue and re-crawling the 
source pages has resulted in files that BBEdit recognises as ‘regular’ UTF-8.

This reduced the anomaly count but whilst testing, I also noticed that the 
read-write cycle updating the output csv file was spawning anomalies and 
expanding those already present. So I wrapped this function to also force UTF-8 
decoding/encoding - and now all is now good.

No longer will I assume that a simple text file is a simple text file! :-)

Thanks & regards,
Keith 

> On 19 May 2021, at 19:01, Ben Rubinstein via use-livecode 
>  wrote:
> 
> Hi Keith,
> 
> This might need input from the mothership, but I think if you've obtained the 
> text from the browser widget's htmlText, it will probably be in the special 
> 'internal' format. I'm not entirely sure what happens when you save that as 
> text - I suspect it depends on the platform.
> 
> So for clarity (if you have the opportunity to re-save this material; and if 
> it won't confuse things because existing files are in one format, and new 
> ones another) it would probably be best to textEncode it into UTF-8, then 
> save it as binfile. That way the files on disk should be UTF-8, which is 
> something like a standard.
> 
> What I tend to do in this situation where I have text files and I'm not sure 
> what the format is (and I spend quite a lot of time messing with text files 
> from various sources, some unknown and many not under my control) is use a 
> good text editor - I use BBedit on Mac, not sure what suitable alternatives 
> would be on Windows or Linux - to investigate the file. BBEdit makes a guess 
> when it opens the file, but allows you to try re-opening in different 
> encodings, and then warns you if there are byte sequences that don't make 
> sense with that encoding. So by doing this I can often figure out what the 
> encoding of the file is - once you've got that, you're off to the races.
> 
> But if you have the opportunity to re-collect the whole set, then I *think* 
> the above formula of textEncoding from LC's internal format to UTF-8, then 
> saving as binary file; and reversing the process when you load them back in 
> to process; and then doing the same again - possibly to a different format - 
> when you output the CSV, should see you clear.
> 
> HTH,
> 
> Ben
> 
> 
> On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:
>> Thanks Ben, that’s really interesting. It never occurred to me that these 
>> html files might be anything other than simple plain text files, as I’d work 
>> with in Coda, etc., for years.
>> The local HTML files are storage of the HTML text pulled from the LiveCode 
>> browser widget, saved using the URL ‘file:’ option. I’d been working ‘live’ 
>> from the Browser widget’s html text until recently, when I’ve introduced 
>> these local files to split page ‘crawling’ and analysis activities without 
>> needing a database.
>> Reading the files back into LiveCode with the URL ‘file:’ option works quite 
>> happily with no text anomalies when put into a field to read. The problem 
>> seems to arise when I load the HTML text into a variable and then start to 
>> extract elements using LiveCode's text chunking. For example pulling the 
>> text between the offsets of say  &  tags is when these character 
>> anomalies have started to pop into the strings.
>> A quick test on reading in the local HTML files with the URL ‘binfile:’ 
>> option and then textDecode(tString, “UTF-8”) seems to reduce the frequency 
>> and size of anomalies, but some remain. So, I’ll see if re-crawling pages 
>> and saving the HTML text from the browser widget as binfiles reduces this 
>> further.
>> Thanks & regards,
>> Keith
>>> On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode 
>>>  wrote:
>>> 
>>> Hi Keith,
>>> 
>>> The thing with character encoding is that you always need to know where 
>>> it's coming from and where it's going.
>>> 
>>> Do you know how the HTML documents were obtained? Saved from a browser, 
>>> fetched by curl, fetched by Livecode? Or generated on disk by something 
>>> else?
>>> 
>>> If it was saved from a browser or fetched by curl, then the format is most 
>>> likely to be UTF-8. In order to see it correctly in LiveCode, you'd need to 
>>> two things:
>>> - read it in as a binary file, rather than text (e.g. use URL 
>>> "binfile://..." or "open file ... for binary read")
>>> - convert it to the internal text format FROM UTF-8 - which means use 
>>> textDecode(tString, "UTF-8"), rather than textEncode
>>> 
>>> If it was fetched by LiveCode, then it 

Re: Spurious characters from html files - text encoding issues?

2021-05-19 Thread Ben Rubinstein via use-livecode

Hi Keith,

This might need input from the mothership, but I think if you've obtained the 
text from the browser widget's htmlText, it will probably be in the special 
'internal' format. I'm not entirely sure what happens when you save that as 
text - I suspect it depends on the platform.


So for clarity (if you have the opportunity to re-save this material; and if 
it won't confuse things because existing files are in one format, and new ones 
another) it would probably be best to textEncode it into UTF-8, then save it 
as binfile. That way the files on disk should be UTF-8, which is something 
like a standard.


What I tend to do in this situation where I have text files and I'm not sure 
what the format is (and I spend quite a lot of time messing with text files 
from various sources, some unknown and many not under my control) is use a 
good text editor - I use BBedit on Mac, not sure what suitable alternatives 
would be on Windows or Linux - to investigate the file. BBEdit makes a guess 
when it opens the file, but allows you to try re-opening in different 
encodings, and then warns you if there are byte sequences that don't make 
sense with that encoding. So by doing this I can often figure out what the 
encoding of the file is - once you've got that, you're off to the races.


But if you have the opportunity to re-collect the whole set, then I *think* 
the above formula of textEncoding from LC's internal format to UTF-8, then 
saving as binary file; and reversing the process when you load them back in to 
process; and then doing the same again - possibly to a different format - when 
you output the CSV, should see you clear.


HTH,

Ben


On 17/05/2021 15:58, Keith Clarke via use-livecode wrote:

Thanks Ben, that’s really interesting. It never occurred to me that these html 
files might be anything other than simple plain text files, as I’d work with in 
Coda, etc., for years.

The local HTML files are storage of the HTML text pulled from the LiveCode 
browser widget, saved using the URL ‘file:’ option. I’d been working ‘live’ 
from the Browser widget’s html text until recently, when I’ve introduced these 
local files to split page ‘crawling’ and analysis activities without needing a 
database.

Reading the files back into LiveCode with the URL ‘file:’ option works quite happily with no 
text anomalies when put into a field to read. The problem seems to arise when I load the HTML 
text into a variable and then start to extract elements using LiveCode's text chunking. For 
example pulling the text between the offsets of say  &  tags is when 
these character anomalies have started to pop into the strings.

A quick test on reading in the local HTML files with the URL ‘binfile:’ option 
and then textDecode(tString, “UTF-8”) seems to reduce the frequency and size of 
anomalies, but some remain. So, I’ll see if re-crawling pages and saving the 
HTML text from the browser widget as binfiles reduces this further.
Thanks & regards,
Keith


On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode 
 wrote:

Hi Keith,

The thing with character encoding is that you always need to know where it's 
coming from and where it's going.

Do you know how the HTML documents were obtained? Saved from a browser, fetched 
by curl, fetched by Livecode? Or generated on disk by something else?

If it was saved from a browser or fetched by curl, then the format is most 
likely to be UTF-8. In order to see it correctly in LiveCode, you'd need to two 
things:
- read it in as a binary file, rather than text (e.g. use URL "binfile://..." or 
"open file ... for binary read")
- convert it to the internal text format FROM UTF-8 - which means use 
textDecode(tString, "UTF-8"), rather than textEncode

If it was fetched by LiveCode, then it most likely arrived over the wire as 
UTF-8, but if it was saved by LiveCode as text (not binary) then it _may_ have 
got corrupted.

If you can see the text looking as you expect in LiveCode, you've solved half the 
problem. Then you need to consider where it's going: who (that) is going to consume the 
CSV. This is the time to use textEncode, and then be sure to save it as a binary file. If 
the consumer will be something reasonably modern, then again UTF-8 is a good default. If 
it's something much older, you might need to use "CP1252" or similar.

HTH,

Ben


On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:

Hi folks,
I’m using LiveCode to summarise text from HTML documents into csv summary files and 
am noticing that when I extract strings from html documents stored on disk - rather 
than visiting the sites via the browser widget & grabbing the HTML text - weird 
characters being inserted in place of what appear to be ‘regular’ characters.
The number of characters inserted can run into the thousands per instance, 
making my csv ‘summary’ file run into gigabytes! Has anyone seen the following 
type of string before, happen to know what might be causing it and offer a 

Re: Spurious characters from html files - text encoding issues?

2021-05-17 Thread Keith Clarke via use-livecode
Thanks Ben, that’s really interesting. It never occurred to me that these html 
files might be anything other than simple plain text files, as I’d work with in 
Coda, etc., for years. 

The local HTML files are storage of the HTML text pulled from the LiveCode 
browser widget, saved using the URL ‘file:’ option. I’d been working ‘live’ 
from the Browser widget’s html text until recently, when I’ve introduced these 
local files to split page ‘crawling’ and analysis activities without needing a 
database. 

Reading the files back into LiveCode with the URL ‘file:’ option works quite 
happily with no text anomalies when put into a field to read. The problem seems 
to arise when I load the HTML text into a variable and then start to extract 
elements using LiveCode's text chunking. For example pulling the text between 
the offsets of say  &  tags is when these character anomalies have 
started to pop into the strings.

A quick test on reading in the local HTML files with the URL ‘binfile:’ option 
and then textDecode(tString, “UTF-8”) seems to reduce the frequency and size of 
anomalies, but some remain. So, I’ll see if re-crawling pages and saving the 
HTML text from the browser widget as binfiles reduces this further.
Thanks & regards,
Keith  

> On 17 May 2021, at 12:57, Ben Rubinstein via use-livecode 
>  wrote:
> 
> Hi Keith,
> 
> The thing with character encoding is that you always need to know where it's 
> coming from and where it's going.
> 
> Do you know how the HTML documents were obtained? Saved from a browser, 
> fetched by curl, fetched by Livecode? Or generated on disk by something else?
> 
> If it was saved from a browser or fetched by curl, then the format is most 
> likely to be UTF-8. In order to see it correctly in LiveCode, you'd need to 
> two things:
>   - read it in as a binary file, rather than text (e.g. use URL 
> "binfile://..." or "open file ... for binary read")
>   - convert it to the internal text format FROM UTF-8 - which means use 
> textDecode(tString, "UTF-8"), rather than textEncode
> 
> If it was fetched by LiveCode, then it most likely arrived over the wire as 
> UTF-8, but if it was saved by LiveCode as text (not binary) then it _may_ 
> have got corrupted.
> 
> If you can see the text looking as you expect in LiveCode, you've solved half 
> the problem. Then you need to consider where it's going: who (that) is going 
> to consume the CSV. This is the time to use textEncode, and then be sure to 
> save it as a binary file. If the consumer will be something reasonably 
> modern, then again UTF-8 is a good default. If it's something much older, you 
> might need to use "CP1252" or similar.
> 
> HTH,
> 
> Ben
> 
> 
> On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:
>> Hi folks,
>> I’m using LiveCode to summarise text from HTML documents into csv summary 
>> files and am noticing that when I extract strings from html documents stored 
>> on disk - rather than visiting the sites via the browser widget & grabbing 
>> the HTML text - weird characters being inserted in place of what appear to 
>> be ‘regular’ characters.
>> The number of characters inserted can run into the thousands per instance, 
>> making my csv ‘summary’ file run into gigabytes! Has anyone seen the 
>> following type of string before, happen to know what might be causing it and 
>> offer a fix?
>> ‚Äö
>> I’ve tried deliberately setting UTF-8 on the extracted strings, with put 
>> textEncode(tString, "UTF-8") into tString. Currently I’m not attempting to 
>> force any text format on the local HTML documents.
>> Thanks & regards,
>> Keith
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription 
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Spurious characters from html files - text encoding issues?

2021-05-17 Thread Ben Rubinstein via use-livecode

Hi Keith,

The thing with character encoding is that you always need to know where it's 
coming from and where it's going.


Do you know how the HTML documents were obtained? Saved from a browser, 
fetched by curl, fetched by Livecode? Or generated on disk by something else?


If it was saved from a browser or fetched by curl, then the format is most 
likely to be UTF-8. In order to see it correctly in LiveCode, you'd need to 
two things:
	- read it in as a binary file, rather than text (e.g. use URL "binfile://..." 
or "open file ... for binary read")
	- convert it to the internal text format FROM UTF-8 - which means use 
textDecode(tString, "UTF-8"), rather than textEncode


If it was fetched by LiveCode, then it most likely arrived over the wire as 
UTF-8, but if it was saved by LiveCode as text (not binary) then it _may_ have 
got corrupted.


If you can see the text looking as you expect in LiveCode, you've solved half 
the problem. Then you need to consider where it's going: who (that) is going 
to consume the CSV. This is the time to use textEncode, and then be sure to 
save it as a binary file. If the consumer will be something reasonably modern, 
then again UTF-8 is a good default. If it's something much older, you might 
need to use "CP1252" or similar.


HTH,

Ben


On 17/05/2021 09:28, Keith Clarke via use-livecode wrote:

Hi folks,
I’m using LiveCode to summarise text from HTML documents into csv summary files and 
am noticing that when I extract strings from html documents stored on disk - rather 
than visiting the sites via the browser widget & grabbing the HTML text - weird 
characters being inserted in place of what appear to be ‘regular’ characters.

The number of characters inserted can run into the thousands per instance, 
making my csv ‘summary’ file run into gigabytes! Has anyone seen the following 
type of string before, happen to know what might be causing it and offer a fix?
‚Äö

I’ve tried deliberately setting UTF-8 on the extracted strings, with put 
textEncode(tString, "UTF-8") into tString. Currently I’m not attempting to 
force any text format on the local HTML documents.

Thanks & regards,
Keith
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Spurious characters from html files - text encoding issues?

2021-05-17 Thread Keith Clarke via use-livecode
Hi folks,
I’m using LiveCode to summarise text from HTML documents into csv summary files 
and am noticing that when I extract strings from html documents stored on disk 
- rather than visiting the sites via the browser widget & grabbing the HTML 
text - weird characters being inserted in place of what appear to be ‘regular’ 
characters.

The number of characters inserted can run into the thousands per instance, 
making my csv ‘summary’ file run into gigabytes! Has anyone seen the following 
type of string before, happen to know what might be causing it and offer a fix? 
‚Äö

I’ve tried deliberately setting UTF-8 on the extracted strings, with put 
textEncode(tString, "UTF-8") into tString. Currently I’m not attempting to 
force any text format on the local HTML documents.

Thanks & regards,
Keith 
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode