Re: byteLen()?

Mark Waddingham via use-livecode Fri, 10 Mar 2017 01:47:48 -0800

On 2017-03-09 22:24, Richard Gaskin via use-livecode wrote:

I'm not sure I follow that, but it almost sounds like no matter what
the encoding each char is mapped to one byte, so a 5-chart string like
"hello" will take up 5 bytes - is that right?


In the case of the implicit conversion the engine does between text
and binary data - yes it is. The number of bytes in the generated
data will be the same as the number of chars in the original text.

However that only relates to the implicit 'compatibility' conversion
the engine does. In new code, it is better to make sure the conversion
is explicit by using textEncode / textDecode.

I have some large files I want to open and read as binary (for speed
mostly; if there's a reason I should be doing that as text let me
know), then I'll work my way through it looking for substrings,
keeping track of the byte offsets within the data where those can be
found.

Once I have my list of byte offsets, I can save that as a sort of
index file, and use "seek" or "read at" to go directly to that portion
of the larger files whenever I need to access that data.

The data files may use a variety of encodings, mostly UTF-8 but I can
expect Latin-ISO or perhaps even UTF-16.  In short, encoding will may
be known in advance.

But since I'm working with binary data the whole time, the encoding
shouldn't matter, should it?

It depends on whether you need to convert a text string into a bytesequenceto search for, and whether you are wanting an exact text match or acaseless

text match.

If the file you are searching is just a text file which you want tosearchas binary then you need to know the encoding of said text file so youcanencode the text you are searching for in the same way. For example, ifyouare search for "foó" and encode it as UTF-16 (which would generate 6bytes)and the (text) file you are searching is UTF-8 encoded then it won'twork.

The UTF-8 encoding of "foó" is different from the UTF-16 encoding.

If the file you are searching is some binary file containing text thenthingsare decidedly more tricky as to do the search accurately you need toknow theexact format of the binary file so you know precisely where the(encoded) textstrings within it sit. This is presuming you are not happy with 'falsepositives'.

(A stackfile, for example, contains encoded text and sequences of byteswhichwere and never will be text - however, it is perfectly possible for thelatter

to match encoded text, just by chance.)

If you are wanting a caseless match rather than an exact match then youprettymuch have to treat the file as text - you can't do caseless matching onarbitrary

bytes as it makes no sense (as they are just bytes with no meaning).

Earlier you wrote:

  the number of bytes in textEncode(tText, kEncoding)

...which implies that I would need to know the encoding (kEncoding),
but do I really need textEncode for the use-case described here?


Strictly speaking that depends on the encoding:

For native encoding - number of bytes == number of codeunits

For UTF-16 - number of bytes = 2 * number of codeunits

For UTF-32 - number of bytes = 4 * number of codeunits

However, UTF-8 is a multibyte encoding based on the codepoints in the
text. A single codepoint can be encoded as 1, 2, 3 or 4 bytes.

The point here being, in order to compute the byte length of a piece of

text encoded as UTF-8 you need to look at each character. SincetextEncode

does that, it is a reasonably clear way of working such things out.

By the way, here I've mentioned three things - codeunit, codepoint and
char:

  - a codeunit is the smallest element in UTF-16 and represents unicode
    codepoints 0-65535 (i.e. fits in a 16-bit unsigned int).

- a codepoint is the natural 'unit' of Unicode - a 21-bit quantitywhichindexes into the Unicode char tables. (UTF-16 encodes the 21-bitquantityby using 'surrogate' pairs of codeunits - meaning that, in thatencoding

    a codepoint can take 1 or 2 codeunits).

  - a char is a sequence of codepoints which are generally considered to
    be a single (human-processable) character.

I'm not sure if the above helps or not - it might be helpful to explainthe

problem you are trying to solve more deeply. I still can't quite see how

the byte length of a piece of text (encoded in a particular encoding) isusefulsince surely you need the byte sequence to search for anyway, in whichcasethe number of bytes is the length of that byte sequence that you alreadyhave...


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: byteLen()?

Reply via email to