Re: [sqlite] [EXTERNAL] Re: Patch: VTable Column Affinity Question and Change Request (WIthdrawn)

2020-02-07 Thread Simon Slavin
On 8 Feb 2020, at 4:05am, Keith Medcalf  wrote:

> filename=FILENAME is the csv filename in quotes.

Please mention in documentation whether full or partial paths can be included.

> header=BOOL   will "parse" the first row to get the column names unless a 
> SCHEMA is specified.

Looks good until I see

> skip=N 

Perhaps it would be better to use "header=N" to nominate a header row.  The 
first row is row 1.  "header=0" means "there is no header row".

Having made this change, "skip=N" says how many lines after the header line 
should be skipped.
sqlite-users mailing list

Re: [sqlite] [EXTERNAL] Re: Patch: VTable Column Affinity Question and Change Request (WIthdrawn)

2020-02-07 Thread Keith Medcalf

I don't think that patch to apply affinities is a good idea since it will 
usually be mostly useless and will negatively impact performance since one can 
and should assume that the actual author of the VTable knows what they are 
doing, currently SQLite3 does not enforce declared VTable column affinities and 
it should stay that way.  That said, although the definitions of the column 
affinities are stored in the internal schema, it is not trivial to access them 
from a non-builtin extension.

Instead I have once again changed the vsv.c VTable plugin so that it does some 
more useful things.

The below parameters work exactly as they did in the csv.c extension with 
regards to setting the functionality with the single exception that now if you 
specify all of header, schema and columns, the header row will still be 
skipped.  In csv.c if you specified both schema and columns then the header row 
was not skipped even if you specified it existed.

filename=FILENAME is the csv filename in quotes.
data=DATA provides the data inline.
header=BOOL   will "parse" the first row to get the column names unless a 
SCHEMA is specified.
schema=STRING allows the specification of the schema to use
columns=N number of columns in VTable

New parameters:


will skip the specified number of rows at the start of the CSV file, exclusive 
of the header row (ie, skip data rows). If the header, via the header option, 
is specified to exist then it will always be skipped. 

fsep=SEP and rsep=SEP

allow to specify the particular separators to be used.  The defaults are 
fsep=',' and rsep='\n' so the default is the same as the original.  The 
separator can be any single character, or it can be a supported escape code (\t 
\v \n \f for horizontal-tab, vertical-tab, newline and form-feed), or it can be 
a hexadecimal coded byte as in '\x1f'.  Multibyte separators cannot be 


where the affinity can be NONE, BLOB, TEXT, INTEGER, REAL, NUMERIC with the 
default being NONE.  NONE does nothing different from the existing csv.c VTable 
(unless you also specify validatetext).  However, if you specify an affinity it 
applies to *every* returned value and makes the following changes:
 - BLOB returns all data as BLOB
 - TEXT returns all data as TEXT
 - INTEGER returns anything that looks like an integer as an INTEGER otherwise 
 - REAL returns anything that looks like a number as a REAL (double) otherwise 
 - NUMERIC returns anything that looks like a number as either an INTEGER or 
REAL otherwise as TEXT
   - if the field "looks like an integer" then an INTEGER is returned
   - if the field "looks like a number" then a REAL is returned unless it can 
be converted to an integer
"looks like an integer" effectively means passing the following regex "^ 
*[+-]?\d+ *$"
"looks like a number" effectively means passing the following regex "^ 
*[+-]?(\d+\.?\d*|\d*\.?\d+)([eE][+-]?\d+)? *$"

Processing is limited by the runtime C library so the detection is not quite so 
good as the affinities built into the SQLite3 core.  This means that something 
that looks like an integer may fail to load as an integer because it is too big 
(918273745985736365575984857636253857564363 looks like an integer and the 
SQLite3 core will recognize this and load it as a double.  The C library is not 
so smart and will probably simply return MAX_INT).

Processing is also limited by the compiler.  Some compilers (wave at Microsoft) 
do not treat long double as being more precise than a regular double (ie, 10 or 
16 bytes rather than 8) and treat that simply as a plain old double (some 
compilers, like Intel, need a compilation option to enable long doubles being 
actually longer than a regular double).  It is detected if this is the case 
(sizeof(long dounble)==sizeof(double)) so that only "integers" that can be 
entirely contained in the mantissa are converted to integer, and others will 
remain as double.


will cause TEXT fields to be validated as containing a valid UTF-8 coding 
sequence (no content check is performed, only the validity of the encoding is 
checked).  If this is turned on together with any type of affinity (ie, other 
than none) then improperly encoded UTF8 text will be returned as a BLOB.  If no 
affinity is in effect (ie, affinity=none) then an error will be thrown rather 
then letting a text field contain invalid data.  This means that fields 
containing embedded nulls (rather than encoded 0 bytes) will not be able to be 
stored as TEXT and will either be BLOB type or throw an invalid encoding error.


when enabled empty fields will be returned as NULL rather than empty strings.  
An empty field means a field with no content (separators are side-by-each).  
Specifically empty strings are not changed.  That is, the middle column in this 
a,,b is a null column but this is not a,"",b

The fact that there's a Highway to Hell but only a