Hi List,

Due to a bug in our own code we inserted a string with embedded nul character 
in the database. This caused strange behavior when we used the built-in 
function instr() on this data. As part of the analysis of this issue I 
investigated how the builtin functions handle strings with embedded nul 
characters. I want to share my findings.

Although I don't find in the sqlite documentation if a nul character is 
special, and ends a string, the specification and implementation of some string 
related functions seem to suggest this. However, the behaviour of string 
functions is not always consistent.
For the tests below I assumed that the intention is that a string ends at the 
first embedded nul character.
Summary:
- The instr() function returns incorrect results.
- The rtrim() and replace() functions return incorrect results, but this is 
usually not noticeable.
- builtin functions not consistent in truncation after embedded nul character

Please document that functions that operate on string arguments process the 
data till the first embedded nul character, if any, and (may) truncate excess 
data (or otherwise that string data may/should not contain embedded nul 
character).
Please adapt the implementation of instr() with string arguments to stop 
searching after an embedded nul character. No strong opinion about desired 
behaviour of rtrim() and replace().

Details:
Below my test queries (sqlite version 3.11.1), executed on TEXT data and BLOB 
data. Most queries produce also HEX-output to show the exact output. Annotated 
output added as comments.

.mode lines
CREATE TABLE test(t TEXT, b BLOB);
INSERT INTO test VALUES (      'A' || x'00' || 'B ',
                          CAST('A' || x'00' || 'B ' AS BLOB));

-- show actual data value and type of data
SELECT t, hex(t), typeof(t), 
       b, hex(b), typeof(b)
  FROM test;
--         t = A
--    hex(t) = 41004220    -- OK. contains embedded nul character
-- typeof(t) = text
--         b = A
--    hex(b) = 41004220    -- OK. contains embedded nul character
-- typeof(b) = blob

SELECT length(t), 
       length(b)
  FROM test;
-- length(t) = 1     -- OK. till first embedded nul character
-- length(b) = 4     -- OK. size of entire blob

SELECT lower(t),
       hex(lower(t)),
       lower(b),
       hex(lower(b))
  FROM test;
--      lower(t) = a  
-- hex(lower(t)) = 61 -- OK. String is assumed to end at embedded nul 
character. 
--      lower(b) = a
-- hex(lower(b)) = 61 -- OK. Blob is interpreted as string (suggested in spec), 
so processed till embedded nul character.

SELECT rtrim(t),
       hex(rtrim(t)),
       rtrim(b),
       hex(rtrim(b))
  FROM test;
--      rtrim(t) = A
-- hex(rtrim(t)) = 410042  -- NOT OK. Data after embedded nul character is 
considered string content. Changed part is usually
--                         --         not noticed if resulting data is 
processed as string (so upto embedded nul char)
--                         --         Note that output is not truncted at nul 
character like lower() does.
--      rtrim(b) = A
-- hex(rtrim(b)) = 410042  -- Blob is interpreted as string. Same behaviour as 
string.

SELECT quote(t),
       hex(quote(t)),
       quote(b),
       hex(quote(b))
  FROM test;

--      quote(t) = 'A'        
-- hex(quote(t)) = 274127                  -- OK. As specified. String till 
first embedded nul character quoted
--      quote(b) = X'41004220'             -- OK. As specified. Hex 
representation of entire string
-- hex(quote(b)) = 5827343130303432323027  

SELECT replace(t, 'B', 'C'),
       hex(replace(t, 'B', 'C')),
       replace(b, 'B', 'C'),
       hex(replace(b, 'B', 'C'))
  FROM test;

--      replace(t, 'B', 'C') = A
-- hex(replace(t, 'B', 'C')) = 41004320  -- NOT OK. Replaces also characters 
after embedded nul. Chnaged part is usually not noticed if
--                                       --         resulting data is processed 
as string (so upto embedded nul char)
--                                       --         Note that output is not 
truncated at embedded nul character like lower() does.
--      replace(b, 'B', 'C') = A
-- hex(replace(b, 'B', 'C')) = 41004320  -- OK. Replaces in entire data

SELECT substr(t, 1, 10),
       hex(substr(t, 1, 10)),
       substr(b, 1, 10),
       hex(substr(b, 1, 10)),
       substr(t, 3, 10),
       hex(substr(t, 3, 10)),
       substr(b, 3, 10),
       hex(substr(b, 3, 10))
  FROM test;
--      substr(t, 1, 10) = A
-- hex(substr(t, 1, 10)) = 41        -- OK. Till embedded nul character
--      substr(b, 1, 10) = A
-- hex(substr(b, 1, 10)) = 41004220  -- OK. Entire data
--      substr(t, 3, 10) = 
-- hex(substr(t, 3, 10)) =           -- OK. Pos 3 is after emdedded nul 
character, so not part of string
--      substr(b, 3, 10) = B 
-- hex(substr(b, 3, 10)) = 4220      -- Ok substitute in all data

SELECT instr(t, 'B'),
       instr(b, 'B')
  FROM test;
-- instr(t, 'B') = 3                 -- NOT OK. String ends at embedded nul 
character. This violates the specification, which 
--                                   --         explicitly says that that 
arguments are treated as strings (so up to emdedded nul).
--                                   --         Should return 0 (=not found).
-- instr(b, 'B') = 3                 -- OK. search in all data

Met Vriendelijke Groet, Kind Regards, 谨致问候,

Rob

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to