On 2017-03-09 22:24, Richard Gaskin via use-livecode wrote:
I'm not sure I follow that, but it almost sounds like no matter what
the encoding each char is mapped to one byte, so a 5-chart string like
"hello" will take up 5 bytes - is that right?

In the case of the implicit conversion the engine does between text
and binary data - yes it is. The number of bytes in the generated
data will be the same as the number of chars in the original text.

However that only relates to the implicit 'compatibility' conversion
the engine does. In new code, it is better to make sure the conversion
is explicit by using textEncode / textDecode.

I have some large files I want to open and read as binary (for speed
mostly; if there's a reason I should be doing that as text let me
know), then I'll work my way through it looking for substrings,
keeping track of the byte offsets within the data where those can be
found.

Once I have my list of byte offsets, I can save that as a sort of
index file, and use "seek" or "read at" to go directly to that portion
of the larger files whenever I need to access that data.

The data files may use a variety of encodings, mostly UTF-8 but I can
expect Latin-ISO or perhaps even UTF-16.  In short, encoding will may
be known in advance.

But since I'm working with binary data the whole time, the encoding
shouldn't matter, should it?

It depends on whether you need to convert a text string into a byte sequence to search for, and whether you are wanting an exact text match or a caseless
text match.

If the file you are searching is just a text file which you want to search as binary then you need to know the encoding of said text file so you can encode the text you are searching for in the same way. For example, if you are search for "foó" and encode it as UTF-16 (which would generate 6 bytes) and the (text) file you are searching is UTF-8 encoded then it won't work.
The UTF-8 encoding of "foó" is different from the UTF-16 encoding.

If the file you are searching is some binary file containing text then things are decidedly more tricky as to do the search accurately you need to know the exact format of the binary file so you know precisely where the (encoded) text strings within it sit. This is presuming you are not happy with 'false positives'.

(A stackfile, for example, contains encoded text and sequences of bytes which were and never will be text - however, it is perfectly possible for the latter
to match encoded text, just by chance.)

If you are wanting a caseless match rather than an exact match then you pretty much have to treat the file as text - you can't do caseless matching on arbitrary
bytes as it makes no sense (as they are just bytes with no meaning).

Earlier you wrote:

  the number of bytes in textEncode(tText, kEncoding)

...which implies that I would need to know the encoding (kEncoding),
but do I really need textEncode for the use-case described here?

Strictly speaking that depends on the encoding:

For native encoding - number of bytes == number of codeunits

For UTF-16 - number of bytes = 2 * number of codeunits

For UTF-32 - number of bytes = 4 * number of codeunits

However, UTF-8 is a multibyte encoding based on the codepoints in the
text. A single codepoint can be encoded as 1, 2, 3 or 4 bytes.

The point here being, in order to compute the byte length of a piece of
text encoded as UTF-8 you need to look at each character. Since textEncode
does that, it is a reasonably clear way of working such things out.

By the way, here I've mentioned three things - codeunit, codepoint and
char:

  - a codeunit is the smallest element in UTF-16 and represents unicode
    codepoints 0-65535 (i.e. fits in a 16-bit unsigned int).

- a codepoint is the natural 'unit' of Unicode - a 21-bit quantity which indexes into the Unicode char tables. (UTF-16 encodes the 21-bit quantity by using 'surrogate' pairs of codeunits - meaning that, in that encoding
    a codepoint can take 1 or 2 codeunits).

  - a char is a sequence of codepoints which are generally considered to
    be a single (human-processable) character.

I'm not sure if the above helps or not - it might be helpful to explain the
problem you are trying to solve more deeply. I still can't quite see how
the byte length of a piece of text (encoded in a particular encoding) is useful since surely you need the byte sequence to search for anyway, in which case the number of bytes is the length of that byte sequence that you already have...

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to