[julia-users] Re: String(Vector{UInt8}) Question

2016-10-05 Thread Steven G. Johnson


On Wednesday, October 5, 2016 at 12:01:32 AM UTC-4, josh...@fastmail.com 
wrote:
>
> OK, I understand now: they're continuation bytes for UTF-8 and can't 
> appear in that context so they get stripped from the string representation.
>

They don't get stripped — invalid data is still stored in the String. 
 However, anything that iterates over Unicode characters (length is a count 
of Unicode codepoints) skips them.

julia> s = String([0x82,0x82,0x82,0x82,0x82])

5-byte String of invalid UTF-8 data:

 0x82

 0x82

 0x82

 0x82

 0x82


julia> length(s)

0


julia> sizeof(s)

5



[julia-users] Re: String(Vector{UInt8}) Question

2016-10-04 Thread joshbode
OK, I understand now: they're continuation bytes for UTF-8 and can't appear 
in that context so they get stripped from the string representation.

On Wednesday, October 5, 2016 at 2:38:54 PM UTC+11, josh...@fastmail.com 
wrote:
>
> Hello julia-users,
>
> I'm scratching my head as to the following behaviour with using String on 
> a Vector of UInt8 values:
>
> for x in 0x00:0xff
> print("$(lpad(x, 5)): ")
> print(length(String([x])))
> if (x + 1) % 8 == 0 println() end
> end
> which on OS X and Linux using Julia 0.5 yields:
> 0: 11: 12: 13: 14: 15: 16: 17: 1
> 8: 19: 1   10: 1   11: 1   12: 1   13: 1   14: 1   15: 1
>16: 1   17: 1   18: 1   19: 1   20: 1   21: 1   22: 1   23: 1
>24: 1   25: 1   26: 1   27: 1   28: 1   29: 1   30: 1   31: 1
>32: 1   33: 1   34: 1   35: 1   36: 1   37: 1   38: 1   39: 1
>40: 1   41: 1   42: 1   43: 1   44: 1   45: 1   46: 1   47: 1
>48: 1   49: 1   50: 1   51: 1   52: 1   53: 1   54: 1   55: 1
>56: 1   57: 1   58: 1   59: 1   60: 1   61: 1   62: 1   63: 1
>64: 1   65: 1   66: 1   67: 1   68: 1   69: 1   70: 1   71: 1
>72: 1   73: 1   74: 1   75: 1   76: 1   77: 1   78: 1   79: 1
>80: 1   81: 1   82: 1   83: 1   84: 1   85: 1   86: 1   87: 1
>88: 1   89: 1   90: 1   91: 1   92: 1   93: 1   94: 1   95: 1
>96: 1   97: 1   98: 1   99: 1  100: 1  101: 1  102: 1  103: 1
>   104: 1  105: 1  106: 1  107: 1  108: 1  109: 1  110: 1  111: 1
>   112: 1  113: 1  114: 1  115: 1  116: 1  117: 1  118: 1  119: 1
>   120: 1  121: 1  122: 1  123: 1  124: 1  125: 1  126: 1  127: 1
>
>
>
>
>
>
>
> *  128: 0  129: 0  130: 0  131: 0  132: 0  133: 0  134: 0  135: 0  136: 0 
>  137: 0  138: 0  139: 0  140: 0  141: 0  142: 0  143: 0  144: 0  145: 0 
>  146: 0  147: 0  148: 0  149: 0  150: 0  151: 0  152: 0  153: 0  154: 0 
>  155: 0  156: 0  157: 0  158: 0  159: 0  160: 0  161: 0  162: 0  163: 0 
>  164: 0  165: 0  166: 0  167: 0  168: 0  169: 0  170: 0  171: 0  172: 0 
>  173: 0  174: 0  175: 0  176: 0  177: 0  178: 0  179: 0  180: 0  181: 0 
>  182: 0  183: 0  184: 0  185: 0  186: 0  187: 0  188: 0  189: 0  190: 0 
>  191: 0*
>   192: 1  193: 1  194: 1  195: 1  196: 1  197: 1  198: 1  199: 1
>   200: 1  201: 1  202: 1  203: 1  204: 1  205: 1  206: 1  207: 1
>   208: 1  209: 1  210: 1  211: 1  212: 1  213: 1  214: 1  215: 1
>   216: 1  217: 1  218: 1  219: 1  220: 1  221: 1  222: 1  223: 1
>   224: 1  225: 1  226: 1  227: 1  228: 1  229: 1  230: 1  231: 1
>   232: 1  233: 1  234: 1  235: 1  236: 1  237: 1  238: 1  239: 1
>   240: 1  241: 1  242: 1  243: 1  244: 1  245: 1  246: 1  247: 1
>   248: 1  249: 1  250: 1  251: 1  252: 1  253: 1  254: 1  255: 1
>
> Why would the range 128-191 yield an empty vector?
> Note: this occurs regardless of the length of the vector being converted 
> (i.e. any element in this range is omitted in the converted string)
>
> Cheers,
> Josh
>