Hi List, Due to a bug in our own code we inserted a string with embedded nul character in the database. This caused strange behavior when we used the built-in function instr() on this data. As part of the analysis of this issue I investigated how the builtin functions handle strings with embedded nul characters. I want to share my findings.
Although I don't find in the sqlite documentation if a nul character is special, and ends a string, the specification and implementation of some string related functions seem to suggest this. However, the behaviour of string functions is not always consistent. For the tests below I assumed that the intention is that a string ends at the first embedded nul character. Summary: - The instr() function returns incorrect results. - The rtrim() and replace() functions return incorrect results, but this is usually not noticeable. - builtin functions not consistent in truncation after embedded nul character Please document that functions that operate on string arguments process the data till the first embedded nul character, if any, and (may) truncate excess data (or otherwise that string data may/should not contain embedded nul character). Please adapt the implementation of instr() with string arguments to stop searching after an embedded nul character. No strong opinion about desired behaviour of rtrim() and replace(). Details: Below my test queries (sqlite version 3.11.1), executed on TEXT data and BLOB data. Most queries produce also HEX-output to show the exact output. Annotated output added as comments. .mode lines CREATE TABLE test(t TEXT, b BLOB); INSERT INTO test VALUES ( 'A' || x'00' || 'B ', CAST('A' || x'00' || 'B ' AS BLOB)); -- show actual data value and type of data SELECT t, hex(t), typeof(t), b, hex(b), typeof(b) FROM test; -- t = A -- hex(t) = 41004220 -- OK. contains embedded nul character -- typeof(t) = text -- b = A -- hex(b) = 41004220 -- OK. contains embedded nul character -- typeof(b) = blob SELECT length(t), length(b) FROM test; -- length(t) = 1 -- OK. till first embedded nul character -- length(b) = 4 -- OK. size of entire blob SELECT lower(t), hex(lower(t)), lower(b), hex(lower(b)) FROM test; -- lower(t) = a -- hex(lower(t)) = 61 -- OK. String is assumed to end at embedded nul character. -- lower(b) = a -- hex(lower(b)) = 61 -- OK. Blob is interpreted as string (suggested in spec), so processed till embedded nul character. SELECT rtrim(t), hex(rtrim(t)), rtrim(b), hex(rtrim(b)) FROM test; -- rtrim(t) = A -- hex(rtrim(t)) = 410042 -- NOT OK. Data after embedded nul character is considered string content. Changed part is usually -- -- not noticed if resulting data is processed as string (so upto embedded nul char) -- -- Note that output is not truncted at nul character like lower() does. -- rtrim(b) = A -- hex(rtrim(b)) = 410042 -- Blob is interpreted as string. Same behaviour as string. SELECT quote(t), hex(quote(t)), quote(b), hex(quote(b)) FROM test; -- quote(t) = 'A' -- hex(quote(t)) = 274127 -- OK. As specified. String till first embedded nul character quoted -- quote(b) = X'41004220' -- OK. As specified. Hex representation of entire string -- hex(quote(b)) = 5827343130303432323027 SELECT replace(t, 'B', 'C'), hex(replace(t, 'B', 'C')), replace(b, 'B', 'C'), hex(replace(b, 'B', 'C')) FROM test; -- replace(t, 'B', 'C') = A -- hex(replace(t, 'B', 'C')) = 41004320 -- NOT OK. Replaces also characters after embedded nul. Chnaged part is usually not noticed if -- -- resulting data is processed as string (so upto embedded nul char) -- -- Note that output is not truncated at embedded nul character like lower() does. -- replace(b, 'B', 'C') = A -- hex(replace(b, 'B', 'C')) = 41004320 -- OK. Replaces in entire data SELECT substr(t, 1, 10), hex(substr(t, 1, 10)), substr(b, 1, 10), hex(substr(b, 1, 10)), substr(t, 3, 10), hex(substr(t, 3, 10)), substr(b, 3, 10), hex(substr(b, 3, 10)) FROM test; -- substr(t, 1, 10) = A -- hex(substr(t, 1, 10)) = 41 -- OK. Till embedded nul character -- substr(b, 1, 10) = A -- hex(substr(b, 1, 10)) = 41004220 -- OK. Entire data -- substr(t, 3, 10) = -- hex(substr(t, 3, 10)) = -- OK. Pos 3 is after emdedded nul character, so not part of string -- substr(b, 3, 10) = B -- hex(substr(b, 3, 10)) = 4220 -- Ok substitute in all data SELECT instr(t, 'B'), instr(b, 'B') FROM test; -- instr(t, 'B') = 3 -- NOT OK. String ends at embedded nul character. This violates the specification, which -- -- explicitly says that that arguments are treated as strings (so up to emdedded nul). -- -- Should return 0 (=not found). -- instr(b, 'B') = 3 -- OK. search in all data Met Vriendelijke Groet, Kind Regards, 谨致问候, Rob _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users