Re: Decoding "quoted-printable" -- Help needed -- Reopened - Solved 2nd
I am very sorry that I am overstressing this list. I keep on answering my own questions. The function needs to address bytes. I found this looking at some similar C# code: # Code snippet from C# # Source: https://stackoverflow.com/questions/32083334/consecutive-control-characters-in-quoted-printable-not-decoding-correctly --- string sHex = input; sHex = sHex.Substring(i + 1, 2); int hex = Convert.ToInt32(sHex, 16); byte b = Convert.ToByte(hex); output.Add(b); i += 3; --- I oversaw that the value must be a byte value. Anyway, that is all new to me. So, the correct and tested converting to and from "quoted-printable" with encoded UTF8 in LiveCode >7 is: --- local tChar local tItem local tCodedChar local tCodePoint local tEncoded local tDecoded set the itemdelimiter to "=" // ENCODE EXAMPLE put "€" into tChar put textEncode ( tChar , "UTF-8" ) into tCodedChar repeat for each codePoint tCodePoint in tCodedChar put "="& baseConvert ( byteToNum ( tCodePoint ) , 10 , 16 ) after tEncoded end repeat put tEncoded into msg ---> "=E2=82=AC" - the quoted-printable UFT-8 encoding of the Euro symbol "€" // DECODE EXAMPLE put "=E2=82=AC" into tEncoded delete char 1 of tEncoded repeat for each item tItem in tEncoded put numToByte ( BaseConvert ( tItem , 16 , 10 ) ) after tDecoded end repeat put textDecode ( tDecoded , "UTF-8" ) into msg --> the Euro symbol "€" --- Thanks to all. Given a bit of time, I will post a solution for UTF8 quoted-printable encoded E-Mail blocks of text in the Forum. Roland --- Am Do., 14. Nov. 2019 um 20:41 Uhr schrieb R.H. : > > Oh, sorry, I was too quick declaring a solution. > > Even though the code of the function works fine, the result also converts back, but the "quoted-printable" or "UTF-8" code expects that each codepoint is encoded in Hex with just two ASCII letters representing a codepoint. > > For example, for the Euro symbol "€" we have three codepoints. > The function below converts to "=E2=201A=AC" while it must be "=E2=82=AC". > The "=" sign is just a delimiter in quoted-printable. > > Now, I do not know what is wrong in my thinking as I am not getting quite the same results. > (The result is ok for other symbols such as 'ü'.) > > EXAMPLE: > > put "€" into tChar >// First encode to UTF-8: > put textEncode(tChar,"UTF-8") into tCodedChar >// Repeat for each codepoint in the UTF-8 char > repeat for each codePoint tCodePoint in tCodedChar >// Encode each codepoint to its integer expression and convert to Hex value: > put "="& BaseConvert ( codePointToNum (tCodePoint) , 10 , 16 ) after tEncoded > end repeat > put tEncoded into field "Show Codepoints" -- Expected ASCII representing Hex numbers > -- Result: "=E2=201A=AC" -- Instead of "=E2=82=AC" , but valid and working. > > The actual "correct" UTF-8 result can be tested here: http://www.endmemo.com/unicode/unicodeconverter.php > > What am I missing? > > Thanks a lot > Roland ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Decoding "quoted-printable" -- Help needed -- Reopened
Oh, sorry, I was too quick declaring a solution. Even though the code of the function works fine, the result also converts back, but the "quoted-printable" or "UTF-8" code expects that each codepoint is encoded in Hex with just two ASCII letters representing a codepoint. For example, for the Euro symbol "€" we have three codepoints. The function below converts to "=E2=201A=AC" while it must be "=E2=82=AC". The "=" sign is just a delimiter in quoted-printable. Now, I do not know what is wrong in my thinking as I am not getting quite the same results. (The result is ok for other symbols such as 'ü'.) EXAMPLE: put "€" into tChar // First encode to UTF-8: put textEncode(tChar,"UTF-8") into tCodedChar // Repeat for each codepoint in the UTF-8 char repeat for each codePoint tCodePoint in tCodedChar // Encode each codepoint to its integer expression and convert to Hex value: put "="& BaseConvert ( codePointToNum (tCodePoint) , 10 , 16 ) after tEncoded end repeat put tEncoded into field "Show Codepoints" -- Expected ASCII representing Hex numbers -- Result: "=E2=201A=AC" -- Instead of "=E2=82=AC" , but valid and working. The actual "correct" UTF-8 result can be tested here: http://www.endmemo.com/unicode/unicodeconverter.php What am I missing? Thanks a lot Roland ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Decoding "quoted-printable" -- Help needed -- Solved
For those interested: With the help of a privately received message hinting at a solution used prior to LC7 I was able to construct the required functions for LC 7 and above. I must say that I am not really aware of all the many functions LiveCode presents, I did not even know about baseConvert() before doing a lot of research. I guess, each of us must go through all the commands and functions LC provides and study them. It is difficult to find when not knowing how and what to search for. Also, I had to try to understand what codepoints are. Here I am not using the actual quoted-printable format of codepoints in Hex presentation each with a equal sign "=" as a prefix. That is easy to retrieve or construct using LiveCode chunk expressions. Instead I am using comma delimited items. // The encoding priot to LC7 according to Mark (still works even today) -- put unidecode(uniencode("e","english"),"UTF8") into x -- put chartonum(char 1 of x) && chartonum(char 2 of x) into y // Encoding and decoding UTF-8 for quoted-printable chars (as they may appear in certain e-mail parts) set the itemdelimiter to "," put "€" into tChar -- Using the Euro symbol which is encoded with 3 codepoints (there can be up to 4 for quoted-printable). // Encode a UTF-8 character to a quoted-printable ASCII encoding put textEncode( tChar ,"UTF-8") into tCodedChar repeat for each codePoint tCodePoint in tCodedChar put BaseConvert ( codePointToNum (tCodePoint) , 10 , 16 ) &"," after tEncoded end repeat delete last char of tEncoded put tEncoded into msg -- just for testing // Decode a quoted-printable ASCII string to UTF-8 put empty into tDecoded repeat for each item tItem in tEncoded put numToCodePoint ( BaseConvert ( tItem , 16, 10 ) ) after tDecoded end repeat put textDecode (tDecoded , "UTF-8") after msg -- just for testing // Result in message box -- E2,201A,AC -- In actual quoted-printable that would be: "=E2=201A=AC" and our items must be converted accordingly -- € Roland ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Decoding "quoted-printable" -- Help needed
Even with a lot of research and comparing functions in C# and Javascript, I do understand it yet. In E-Mail-bodies, the content parts are often either based64-encoded, no problem with that, but there are also other encodings called "quoted-printable". This is text that in my case needs to be converted to UTF-8. Now, here all characters that are not pure ASCII are marked with a equal sign "=" (similar to the "%" in an URL encoded string) and the following two characters define the byte value in Hex notation. There can be one, two and even three separate byte values for a character encoded in UTF-8. Example: "F=C3=BCr". This translates to the German Umlaut and would render to the string "für". The "ü" is not part of the pure ASCII and therefore it is encoded this way. It is an encoding specific for UTF-8. Now, as you can see, there is not just one byte represented with "=C3". There are actually two bytes "=C3=BC": represented in Hex by "C3" and "BC" each individually converted to decimal notation as 195 and 188. If you URL-encode the single bytes using "%" instead of "=" such as "%U3" it will give it's own character whith will be "À". The URL-encoding of "%BC" gives "Ä". So, this does not help. I have to somenow look at the two bytes together. Converting pure ASCI to Hex gives the correct result in other programs: -- Link: https://www.rapidtables.com/convert/number/ascii-to-hex.html: -- Enter: "ü" -- Result: "C3,BC" --- what we are looking for when encoding: Two separate byte representations. -- But it only works when the character encoding is UTF-8. How do I come from "=C3=BC" to codepoint("ü") = 252? What do I need to calculate? How do we decode such "quoted-printable" encoded string to UTF-8? Thanks in advance...) Roland ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode