Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Gary R. Schmidt

On 27/01/2018 05:32, Peter Da Silva wrote:

On 1/26/18, 12:31 PM, "sqlite-users on behalf of J Decker" 
 wrote:

ctrl-z was end of file text character in DOS (wrote char 26; not FF)


DOS wasn't an operating system.
  
That will come as a surprise to the people who used DOS/360 and DOS/VSE 
and their various siblings.


Cheers,
GaryB-)
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread petern
trivial-cross-
>> > join-Defect-tt10.html
>> >
>> > http://sqlite.1065341.n5.nabble.com/Defect-trivial-cross-
>> join-strips-BLOB-
>> > subtype-tt99982.html
>> >
>> > Still waiting for an answer about why BLOB types are lost in a trivial
>> > cross join.  No answer.
>> >
>>
>> That's interesting; I have seen, in the shell, that if the column is blob,
>> it is just not shown.
>>
>>
>> >
>> > Is there a legitimate reason why BLOB type information cannot be passed
>> > through a join?
>> >
>> >
>>
>>
>>
>> > Peter
>> >
>> > On Fri, Jan 26, 2018 at 4:36 PM, Keith Medcalf <kmedc...@dessus.com>
>> > wrote:
>> >
>> > >
>> > > I do not understand this at all.
>> > >
>> > > If the definition of a C-String is a "bunch-a-non-zero-byes-
>> > terminated-by-a-zero-byte",
>> > > then how is it possible to have a zero/null byte "embedded" within a
>> > > C-Style String?
>> > >
>> > > Similarly, if a C-Style-Wide-String is defined as a
>> > > "bunch-a-non-zero-words-terminated-by-a-zero-word", then how is it
>> > > possible to have a zero/null word "embedded" within a
>> > C-Style-Wide-String?
>> > >
>> > > Given that SQLite3 is written in C and uses C-Strings or
>> > > C-Style-Wide-Strings, then you cannot have zero/null bytes embedded in
>> > > those strings.
>> > >
>> > > You may of course argue that perhaps SQLite3 should use something
>> other
>> > > than C-Style-Strings, however, this is not what seems to be
>> proposed.  It
>> > > seems to be proposing the use of some magical C-Style-String that is
>> not
>> > > actually a C-Style-String, without explicitly stating this.
>> > >
>> > > SQLite3 does handle non-C-Ctyle-Strings.  They are called "blobs".
>> > >
>> > > ---
>> > > The fact that there's a Highway to Hell but only a Stairway to Heaven
>> > says
>> > > a lot about anticipated traffic volume.
>> > >
>> > >
>> > > >-Original Message-
>> > > >From: sqlite-users [mailto:sqlite-users-
>> > > >boun...@mailinglists.sqlite.org] On Behalf Of J Decker
>> > > >Sent: Friday, 26 January, 2018 17:18
>> > > >To: SQLite mailing list
>> > > >Subject: Re: [sqlite] UTF8 and NUL
>> > > >
>> > > >On Fri, Jan 26, 2018 at 3:56 PM, Peter Da Silva <
>> > > >peter.dasi...@flightaware.com> wrote:
>> > > >
>> > > >> On 2018-01-26, at 17:05, J Decker <d3c...@gmail.com> wrote:
>> > > >> > On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
>> > > >> > peter.dasi...@flightaware.com> wrote:
>> > > >> >> Sqlite uses NUL as the string terminator internally, the
>> > > >published API
>> > > >> >> specifies has stuff like this all over the place:
>> > > >>
>> > > >> >>> In those routines that have a fourth argument, its value is the
>> > > >number
>> > > >> of bytes in the parameter. To be clear: the value is the number of
>> > > >bytes in
>> > > >> the value, not the number of characters. If the fourth parameter to
>> > > >> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the
>> > > >length
>> > > >> of the string is the number of bytes UP TO THE FIRST ZERO
>> > > >TERMINATOR.
>> > > >>
>> > > >> > You stressed the wrong part there - *IS NEGATIVE*
>> > > >>
>> > > >> Why? Passing -1 as the length is a common way to tell sqlite3 to
>> > > >calculate
>> > > >> the length itself. It's a documented and widely used part of the
>> > > >API.
>> > > >
>> > > >
>> > > >Exactly, so on neither side, input or output is there a problem
>> > > >storing a
>> > > >length of valid characters.
>> > > >The deficiency is 1) the command line tool for diagnostics
>> > > >2) always scanning for a nul in prepare() unless the length is before
>> > > >that.  It's simple to add an option that could change that behavior;
>

Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
 about why BLOB types are lost in a trivial
> > > cross join.  No answer.
> > >
> >
> > That's interesting; I have seen, in the shell, that if the column is
> blob,
> > it is just not shown.
> >
> >
> > >
> > > Is there a legitimate reason why BLOB type information cannot be passed
> > > through a join?
> > >
> > >
> >
> >
> >
> > > Peter
> > >
> > > On Fri, Jan 26, 2018 at 4:36 PM, Keith Medcalf <kmedc...@dessus.com>
> > > wrote:
> > >
> > > >
> > > > I do not understand this at all.
> > > >
> > > > If the definition of a C-String is a "bunch-a-non-zero-byes-
> > > terminated-by-a-zero-byte",
> > > > then how is it possible to have a zero/null byte "embedded" within a
> > > > C-Style String?
> > > >
> > > > Similarly, if a C-Style-Wide-String is defined as a
> > > > "bunch-a-non-zero-words-terminated-by-a-zero-word", then how is it
> > > > possible to have a zero/null word "embedded" within a
> > > C-Style-Wide-String?
> > > >
> > > > Given that SQLite3 is written in C and uses C-Strings or
> > > > C-Style-Wide-Strings, then you cannot have zero/null bytes embedded
> in
> > > > those strings.
> > > >
> > > > You may of course argue that perhaps SQLite3 should use something
> other
> > > > than C-Style-Strings, however, this is not what seems to be proposed.
> > It
> > > > seems to be proposing the use of some magical C-Style-String that is
> > not
> > > > actually a C-Style-String, without explicitly stating this.
> > > >
> > > > SQLite3 does handle non-C-Ctyle-Strings.  They are called "blobs".
> > > >
> > > > ---
> > > > The fact that there's a Highway to Hell but only a Stairway to Heaven
> > > says
> > > > a lot about anticipated traffic volume.
> > > >
> > > >
> > > > >-Original Message-
> > > > >From: sqlite-users [mailto:sqlite-users-
> > > > >boun...@mailinglists.sqlite.org] On Behalf Of J Decker
> > > > >Sent: Friday, 26 January, 2018 17:18
> > > > >To: SQLite mailing list
> > > > >Subject: Re: [sqlite] UTF8 and NUL
> > > > >
> > > > >On Fri, Jan 26, 2018 at 3:56 PM, Peter Da Silva <
> > > > >peter.dasi...@flightaware.com> wrote:
> > > > >
> > > > >> On 2018-01-26, at 17:05, J Decker <d3c...@gmail.com> wrote:
> > > > >> > On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
> > > > >> > peter.dasi...@flightaware.com> wrote:
> > > > >> >> Sqlite uses NUL as the string terminator internally, the
> > > > >published API
> > > > >> >> specifies has stuff like this all over the place:
> > > > >>
> > > > >> >>> In those routines that have a fourth argument, its value is
> the
> > > > >number
> > > > >> of bytes in the parameter. To be clear: the value is the number of
> > > > >bytes in
> > > > >> the value, not the number of characters. If the fourth parameter
> to
> > > > >> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the
> > > > >length
> > > > >> of the string is the number of bytes UP TO THE FIRST ZERO
> > > > >TERMINATOR.
> > > > >>
> > > > >> > You stressed the wrong part there - *IS NEGATIVE*
> > > > >>
> > > > >> Why? Passing -1 as the length is a common way to tell sqlite3 to
> > > > >calculate
> > > > >> the length itself. It's a documented and widely used part of the
> > > > >API.
> > > > >
> > > > >
> > > > >Exactly, so on neither side, input or output is there a problem
> > > > >storing a
> > > > >length of valid characters.
> > > > >The deficiency is 1) the command line tool for diagnostics
> > > > >2) always scanning for a nul in prepare() unless the length is
> before
> > > > >that.  It's simple to add an option that could change that behavior;
> > > > >or
> > > > >move the string measuring up to prepare[_v2,_v3,_

Re: [sqlite] UTF8 and NUL

2018-01-26 Thread petern
t;
> > > Given that SQLite3 is written in C and uses C-Strings or
> > > C-Style-Wide-Strings, then you cannot have zero/null bytes embedded in
> > > those strings.
> > >
> > > You may of course argue that perhaps SQLite3 should use something other
> > > than C-Style-Strings, however, this is not what seems to be proposed.
> It
> > > seems to be proposing the use of some magical C-Style-String that is
> not
> > > actually a C-Style-String, without explicitly stating this.
> > >
> > > SQLite3 does handle non-C-Ctyle-Strings.  They are called "blobs".
> > >
> > > ---
> > > The fact that there's a Highway to Hell but only a Stairway to Heaven
> > says
> > > a lot about anticipated traffic volume.
> > >
> > >
> > > >-Original Message-
> > > >From: sqlite-users [mailto:sqlite-users-
> > > >boun...@mailinglists.sqlite.org] On Behalf Of J Decker
> > > >Sent: Friday, 26 January, 2018 17:18
> > > >To: SQLite mailing list
> > > >Subject: Re: [sqlite] UTF8 and NUL
> > > >
> > > >On Fri, Jan 26, 2018 at 3:56 PM, Peter Da Silva <
> > > >peter.dasi...@flightaware.com> wrote:
> > > >
> > > >> On 2018-01-26, at 17:05, J Decker <d3c...@gmail.com> wrote:
> > > >> > On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
> > > >> > peter.dasi...@flightaware.com> wrote:
> > > >> >> Sqlite uses NUL as the string terminator internally, the
> > > >published API
> > > >> >> specifies has stuff like this all over the place:
> > > >>
> > > >> >>> In those routines that have a fourth argument, its value is the
> > > >number
> > > >> of bytes in the parameter. To be clear: the value is the number of
> > > >bytes in
> > > >> the value, not the number of characters. If the fourth parameter to
> > > >> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the
> > > >length
> > > >> of the string is the number of bytes UP TO THE FIRST ZERO
> > > >TERMINATOR.
> > > >>
> > > >> > You stressed the wrong part there - *IS NEGATIVE*
> > > >>
> > > >> Why? Passing -1 as the length is a common way to tell sqlite3 to
> > > >calculate
> > > >> the length itself. It's a documented and widely used part of the
> > > >API.
> > > >
> > > >
> > > >Exactly, so on neither side, input or output is there a problem
> > > >storing a
> > > >length of valid characters.
> > > >The deficiency is 1) the command line tool for diagnostics
> > > >2) always scanning for a nul in prepare() unless the length is before
> > > >that.  It's simple to add an option that could change that behavior;
> > > >or
> > > >move the string measuring up to prepare[_v2,_v3,_v4] and even add a
> > > >V5 that
> > > >just passes the length passed without a scan.
> > > >
> > > >The input is read by a tokenizer that returns in-buffer references to
> > > >the
> > > >next SQL token by length.
> > > >Some tokens can be quoted, and those end up being a copy of the
> > > >original;
> > > >but the length of the SQL statement should already be known, so it
> > > >doesn't
> > > >need to scan for 0.
> > > >
> > > >Once tokenized it's converted into expressions; those expressions
> > > >(have
> > > >previously) stored only the char*.  It's not a lot of places to
> > > >change to
> > > >include storing the length; which is often known unless the mprintf
> > > >internals are used; then any token passed through that does not pass
> > > >%s.
> > > >So %s cannot be used for UTF8 strings; but rather the literal string
> > > >fwrite( buf, 1, stringlen, <output) ; gets all the standard character
> > > >treatment as the file was opened with (O_BINARY or not, "b" or "t"
> > > >specifiers for fopen, or stderr ).
> > > >
> > > >fprintf( out, "%s", (string) );
> > > >is exactly the same as
> > > >fwrite( out, 1, strlen( string ), string );
> > > >
> > > >(Can anyone dispute that?  I doubt that's specified)
> > > >
> > > >Other than, the fwrite wil

Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
char inserts two chars for these... so it's hard to generate a sequence
that looks like '1' for length function... inserting a C string that was
'hi\xc2\x93\x93\x93\x93\x93\x93' length would be 3.
'hi\x93\x93\x93\x93\x93\x93' length would be 9.

but without bind... this is as much as I can show.

insert into test (a) values
('hi'||char(192)||char(150)||char(150)||char(150)||char(150)||'there' );
insert into test (a) values ('hi'||char(0)||'there' );
select length(a),a from test;
(output to windows terminal)
12|hiÀ––––there
2|hi

(in notepad++, )
12|hiÀthere
2|hi

(although that is what I would expect.  I wouldn't suggest changing any of
that, well ya, ...)

2|hi'||char(0)||'there

would look better - but in the select output context there aren't
quotes although that does work to preserve data for sqlite backup.

On Fri, Jan 26, 2018 at 5:22 PM, petern <peter.nichvolo...@gmail.com> wrote:

> That's an interesting idea, using BLOBs.  BLOB strings would be more
> practical if common SQL scalar operators { || , LIKE, =, <>,...} could be
> overloaded with user definable BLOB specific implementations.  At the same
> time subtype and pointer type would have to be improved to work in all
> cases.  This would be far more general solution than messing up the API
> with ubiquitous length argument.
>
> FYI, here are some current very serious deficiencies with BLOB type system
> including detailed test programs:
>

Blob also is binary, and I'm not dealing with binary, i'm dealing with UTF8
Text.  It is a totally different sort of thing than a BLOB would be.


>
> http://sqlite.1065341.n5.nabble.com/sqlite3-value-
> pointer-metadata-is-also-stripped-by-trivial-cross-
> join-Defect-tt10.html
>
> http://sqlite.1065341.n5.nabble.com/Defect-trivial-cross-join-strips-BLOB-
> subtype-tt99982.html
>
> Still waiting for an answer about why BLOB types are lost in a trivial
> cross join.  No answer.
>

That's interesting; I have seen, in the shell, that if the column is blob,
it is just not shown.


>
> Is there a legitimate reason why BLOB type information cannot be passed
> through a join?
>
>



> Peter
>
> On Fri, Jan 26, 2018 at 4:36 PM, Keith Medcalf <kmedc...@dessus.com>
> wrote:
>
> >
> > I do not understand this at all.
> >
> > If the definition of a C-String is a "bunch-a-non-zero-byes-
> terminated-by-a-zero-byte",
> > then how is it possible to have a zero/null byte "embedded" within a
> > C-Style String?
> >
> > Similarly, if a C-Style-Wide-String is defined as a
> > "bunch-a-non-zero-words-terminated-by-a-zero-word", then how is it
> > possible to have a zero/null word "embedded" within a
> C-Style-Wide-String?
> >
> > Given that SQLite3 is written in C and uses C-Strings or
> > C-Style-Wide-Strings, then you cannot have zero/null bytes embedded in
> > those strings.
> >
> > You may of course argue that perhaps SQLite3 should use something other
> > than C-Style-Strings, however, this is not what seems to be proposed.  It
> > seems to be proposing the use of some magical C-Style-String that is not
> > actually a C-Style-String, without explicitly stating this.
> >
> > SQLite3 does handle non-C-Ctyle-Strings.  They are called "blobs".
> >
> > ---
> > The fact that there's a Highway to Hell but only a Stairway to Heaven
> says
> > a lot about anticipated traffic volume.
> >
> >
> > >-Original Message-
> > >From: sqlite-users [mailto:sqlite-users-
> > >boun...@mailinglists.sqlite.org] On Behalf Of J Decker
> > >Sent: Friday, 26 January, 2018 17:18
> > >To: SQLite mailing list
> > >Subject: Re: [sqlite] UTF8 and NUL
> > >
> > >On Fri, Jan 26, 2018 at 3:56 PM, Peter Da Silva <
> > >peter.dasi...@flightaware.com> wrote:
> > >
> > >> On 2018-01-26, at 17:05, J Decker <d3c...@gmail.com> wrote:
> > >> > On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
> > >> > peter.dasi...@flightaware.com> wrote:
> > >> >> Sqlite uses NUL as the string terminator internally, the
> > >published API
> > >> >> specifies has stuff like this all over the place:
> > >>
> > >> >>> In those routines that have a fourth argument, its value is the
> > >number
> > >> of bytes in the parameter. To be clear: the value is the number of
> > >bytes in
> > >> the value, not the number of characters. If the fourth parameter to
> > >> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the
> > >length
> >

Re: [sqlite] UTF8 and NUL

2018-01-26 Thread petern
That's an interesting idea, using BLOBs.  BLOB strings would be more
practical if common SQL scalar operators { || , LIKE, =, <>,...} could be
overloaded with user definable BLOB specific implementations.  At the same
time subtype and pointer type would have to be improved to work in all
cases.  This would be far more general solution than messing up the API
with ubiquitous length argument.

FYI, here are some current very serious deficiencies with BLOB type system
including detailed test programs:

http://sqlite.1065341.n5.nabble.com/sqlite3-value-pointer-metadata-is-also-stripped-by-trivial-cross-join-Defect-tt10.html

http://sqlite.1065341.n5.nabble.com/Defect-trivial-cross-join-strips-BLOB-subtype-tt99982.html

Still waiting for an answer about why BLOB types are lost in a trivial
cross join.  No answer.

Is there a legitimate reason why BLOB type information cannot be passed
through a join?

Peter

On Fri, Jan 26, 2018 at 4:36 PM, Keith Medcalf <kmedc...@dessus.com> wrote:

>
> I do not understand this at all.
>
> If the definition of a C-String is a 
> "bunch-a-non-zero-byes-terminated-by-a-zero-byte",
> then how is it possible to have a zero/null byte "embedded" within a
> C-Style String?
>
> Similarly, if a C-Style-Wide-String is defined as a
> "bunch-a-non-zero-words-terminated-by-a-zero-word", then how is it
> possible to have a zero/null word "embedded" within a C-Style-Wide-String?
>
> Given that SQLite3 is written in C and uses C-Strings or
> C-Style-Wide-Strings, then you cannot have zero/null bytes embedded in
> those strings.
>
> You may of course argue that perhaps SQLite3 should use something other
> than C-Style-Strings, however, this is not what seems to be proposed.  It
> seems to be proposing the use of some magical C-Style-String that is not
> actually a C-Style-String, without explicitly stating this.
>
> SQLite3 does handle non-C-Ctyle-Strings.  They are called "blobs".
>
> ---
> The fact that there's a Highway to Hell but only a Stairway to Heaven says
> a lot about anticipated traffic volume.
>
>
> >-Original Message-
> >From: sqlite-users [mailto:sqlite-users-
> >boun...@mailinglists.sqlite.org] On Behalf Of J Decker
> >Sent: Friday, 26 January, 2018 17:18
> >To: SQLite mailing list
> >Subject: Re: [sqlite] UTF8 and NUL
> >
> >On Fri, Jan 26, 2018 at 3:56 PM, Peter Da Silva <
> >peter.dasi...@flightaware.com> wrote:
> >
> >> On 2018-01-26, at 17:05, J Decker <d3c...@gmail.com> wrote:
> >> > On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
> >> > peter.dasi...@flightaware.com> wrote:
> >> >> Sqlite uses NUL as the string terminator internally, the
> >published API
> >> >> specifies has stuff like this all over the place:
> >>
> >> >>> In those routines that have a fourth argument, its value is the
> >number
> >> of bytes in the parameter. To be clear: the value is the number of
> >bytes in
> >> the value, not the number of characters. If the fourth parameter to
> >> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the
> >length
> >> of the string is the number of bytes UP TO THE FIRST ZERO
> >TERMINATOR.
> >>
> >> > You stressed the wrong part there - *IS NEGATIVE*
> >>
> >> Why? Passing -1 as the length is a common way to tell sqlite3 to
> >calculate
> >> the length itself. It's a documented and widely used part of the
> >API.
> >
> >
> >Exactly, so on neither side, input or output is there a problem
> >storing a
> >length of valid characters.
> >The deficiency is 1) the command line tool for diagnostics
> >2) always scanning for a nul in prepare() unless the length is before
> >that.  It's simple to add an option that could change that behavior;
> >or
> >move the string measuring up to prepare[_v2,_v3,_v4] and even add a
> >V5 that
> >just passes the length passed without a scan.
> >
> >The input is read by a tokenizer that returns in-buffer references to
> >the
> >next SQL token by length.
> >Some tokens can be quoted, and those end up being a copy of the
> >original;
> >but the length of the SQL statement should already be known, so it
> >doesn't
> >need to scan for 0.
> >
> >Once tokenized it's converted into expressions; those expressions
> >(have
> >previously) stored only the char*.  It's not a lot of places to
> >change to
> >include storing the length; which is often known unless the mprintf
> >internals are used; then any token passed throug

Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
On Fri, Jan 26, 2018 at 4:36 PM, Keith Medcalf <kmedc...@dessus.com> wrote:

>
> I do not understand this at all.
>
> If the definition of a C-String is a 
> "bunch-a-non-zero-byes-terminated-by-a-zero-byte",
> then how is it possible to have a zero/null byte "embedded" within a
> C-Style String?
>
this is about SQL not C.  C is easy \0.  Lots of \\, support. \x00

char thing[] = "Hello\0world";
#define strlen_thing(a)  ( sizeof(a)/(sizeof(a[0])-1 ) )



>
> Similarly, if a C-Style-Wide-String is defined as a
> "bunch-a-non-zero-words-terminated-by-a-zero-word", then how is it
> possible to have a zero/null word "embedded" within a C-Style-Wide-String?
>
> Given that SQLite3 is written in C and uses C-Strings or
> C-Style-Wide-Strings, then you cannot have zero/null bytes embedded in
> those strings.
>
> It certainly can.  It's isolated from the language by working on tokens
and expression pieces for which it has(or can have) the length.



> You may of course argue that perhaps SQLite3 should use something other
> than C-Style-Strings, however, this is not what seems to be proposed.  It
> seems to be proposing the use of some magical C-Style-String that is not
> actually a C-Style-String, without explicitly stating this.
>
>
They don't use c style strings though, the strings passed are SQL, which
only have a single escape for the type of quote it starts with.  If it used
'c style strings' it would behave like MySQL/TSQL but it's not; it's PSQL.

"Hello""World"
'Hello''world'


> SQLite3 does handle non-C-Ctyle-Strings.  They are called "blobs".
>
> It does, and has for a decade or more as char.



> ---
> The fact that there's a Highway to Hell but only a Stairway to Heaven says
> a lot about anticipated traffic volume.
>
>
> >-Original Message-----
> >From: sqlite-users [mailto:sqlite-users-
> >boun...@mailinglists.sqlite.org] On Behalf Of J Decker
> >Sent: Friday, 26 January, 2018 17:18
> >To: SQLite mailing list
> >Subject: Re: [sqlite] UTF8 and NUL
> >
> >On Fri, Jan 26, 2018 at 3:56 PM, Peter Da Silva <
> >peter.dasi...@flightaware.com> wrote:
> >
> >> On 2018-01-26, at 17:05, J Decker <d3c...@gmail.com> wrote:
> >> > On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
> >> > peter.dasi...@flightaware.com> wrote:
> >> >> Sqlite uses NUL as the string terminator internally, the
> >published API
> >> >> specifies has stuff like this all over the place:
> >>
> >> >>> In those routines that have a fourth argument, its value is the
> >number
> >> of bytes in the parameter. To be clear: the value is the number of
> >bytes in
> >> the value, not the number of characters. If the fourth parameter to
> >> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the
> >length
> >> of the string is the number of bytes UP TO THE FIRST ZERO
> >TERMINATOR.
> >>
> >> > You stressed the wrong part there - *IS NEGATIVE*
> >>
> >> Why? Passing -1 as the length is a common way to tell sqlite3 to
> >calculate
> >> the length itself. It's a documented and widely used part of the
> >API.
> >
> >
> >Exactly, so on neither side, input or output is there a problem
> >storing a
> >length of valid characters.
> >The deficiency is 1) the command line tool for diagnostics
> >2) always scanning for a nul in prepare() unless the length is before
> >that.  It's simple to add an option that could change that behavior;
> >or
> >move the string measuring up to prepare[_v2,_v3,_v4] and even add a
> >V5 that
> >just passes the length passed without a scan.
> >
> >The input is read by a tokenizer that returns in-buffer references to
> >the
> >next SQL token by length.
> >Some tokens can be quoted, and those end up being a copy of the
> >original;
> >but the length of the SQL statement should already be known, so it
> >doesn't
> >need to scan for 0.
> >
> >Once tokenized it's converted into expressions; those expressions
> >(have
> >previously) stored only the char*.  It's not a lot of places to
> >change to
> >include storing the length; which is often known unless the mprintf
> >internals are used; then any token passed through that does not pass
> >%s.
> >So %s cannot be used for UTF8 strings; but rather the literal string
> >fwrite( buf, 1, stringlen, <output) ; gets all the standard character
> >treatment as the file was opened with (O_BINARY or not, "b" or 

Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Keith Medcalf

I do not understand this at all.  

If the definition of a C-String is a 
"bunch-a-non-zero-byes-terminated-by-a-zero-byte", then how is it possible to 
have a zero/null byte "embedded" within a C-Style String?

Similarly, if a C-Style-Wide-String is defined as a 
"bunch-a-non-zero-words-terminated-by-a-zero-word", then how is it possible to 
have a zero/null word "embedded" within a C-Style-Wide-String?

Given that SQLite3 is written in C and uses C-Strings or C-Style-Wide-Strings, 
then you cannot have zero/null bytes embedded in those strings.

You may of course argue that perhaps SQLite3 should use something other than 
C-Style-Strings, however, this is not what seems to be proposed.  It seems to 
be proposing the use of some magical C-Style-String that is not actually a 
C-Style-String, without explicitly stating this.

SQLite3 does handle non-C-Ctyle-Strings.  They are called "blobs".

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.


>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of J Decker
>Sent: Friday, 26 January, 2018 17:18
>To: SQLite mailing list
>Subject: Re: [sqlite] UTF8 and NUL
>
>On Fri, Jan 26, 2018 at 3:56 PM, Peter Da Silva <
>peter.dasi...@flightaware.com> wrote:
>
>> On 2018-01-26, at 17:05, J Decker <d3c...@gmail.com> wrote:
>> > On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
>> > peter.dasi...@flightaware.com> wrote:
>> >> Sqlite uses NUL as the string terminator internally, the
>published API
>> >> specifies has stuff like this all over the place:
>>
>> >>> In those routines that have a fourth argument, its value is the
>number
>> of bytes in the parameter. To be clear: the value is the number of
>bytes in
>> the value, not the number of characters. If the fourth parameter to
>> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the
>length
>> of the string is the number of bytes UP TO THE FIRST ZERO
>TERMINATOR.
>>
>> > You stressed the wrong part there - *IS NEGATIVE*
>>
>> Why? Passing -1 as the length is a common way to tell sqlite3 to
>calculate
>> the length itself. It's a documented and widely used part of the
>API.
>
>
>Exactly, so on neither side, input or output is there a problem
>storing a
>length of valid characters.
>The deficiency is 1) the command line tool for diagnostics
>2) always scanning for a nul in prepare() unless the length is before
>that.  It's simple to add an option that could change that behavior;
>or
>move the string measuring up to prepare[_v2,_v3,_v4] and even add a
>V5 that
>just passes the length passed without a scan.
>
>The input is read by a tokenizer that returns in-buffer references to
>the
>next SQL token by length.
>Some tokens can be quoted, and those end up being a copy of the
>original;
>but the length of the SQL statement should already be known, so it
>doesn't
>need to scan for 0.
>
>Once tokenized it's converted into expressions; those expressions
>(have
>previously) stored only the char*.  It's not a lot of places to
>change to
>include storing the length; which is often known unless the mprintf
>internals are used; then any token passed through that does not pass
>%s.
>So %s cannot be used for UTF8 strings; but rather the literal string
>fwrite( buf, 1, stringlen, <output) ; gets all the standard character
>treatment as the file was opened with (O_BINARY or not, "b" or "t"
>specifiers for fopen, or stderr ).
>
>fprintf( out, "%s", (string) );
>is exactly the same as
>fwrite( out, 1, strlen( string ), string );
>
>(Can anyone dispute that?  I doubt that's specified)
>
>Other than, the fwrite will include outputing the NUL character and
>trust
>the length given to it. \n will still get promoted to \r\n depending
>on
>platform and C library personality.
>
>
>
>
>
>> Therefore:
>>
>
>
>>
>> >> It would be a huge push-up to change this, it would break
>everything,
>> >> including extensions. I don't think it would be possible until
>something
>> >> like sqlite4.
>>
>
>maybe I don't understand what you're saying 'it' is.
>
>
>> ___
>> sqlite-users mailing list
>> sqlite-users@mailinglists.sqlite.org
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-
>users
>>
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
On Fri, Jan 26, 2018 at 3:56 PM, Peter Da Silva <
peter.dasi...@flightaware.com> wrote:

> On 2018-01-26, at 17:05, J Decker  wrote:
> > On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
> > peter.dasi...@flightaware.com> wrote:
> >> Sqlite uses NUL as the string terminator internally, the published API
> >> specifies has stuff like this all over the place:
>
> >>> In those routines that have a fourth argument, its value is the number
> of bytes in the parameter. To be clear: the value is the number of bytes in
> the value, not the number of characters. If the fourth parameter to
> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length
> of the string is the number of bytes UP TO THE FIRST ZERO TERMINATOR.
>
> > You stressed the wrong part there - *IS NEGATIVE*
>
> Why? Passing -1 as the length is a common way to tell sqlite3 to calculate
> the length itself. It's a documented and widely used part of the API.


Exactly, so on neither side, input or output is there a problem storing a
length of valid characters.
The deficiency is 1) the command line tool for diagnostics
2) always scanning for a nul in prepare() unless the length is before
that.  It's simple to add an option that could change that behavior; or
move the string measuring up to prepare[_v2,_v3,_v4] and even add a V5 that
just passes the length passed without a scan.

The input is read by a tokenizer that returns in-buffer references to the
next SQL token by length.
Some tokens can be quoted, and those end up being a copy of the original;
but the length of the SQL statement should already be known, so it doesn't
need to scan for 0.

Once tokenized it's converted into expressions; those expressions (have
previously) stored only the char*.  It's not a lot of places to change to
include storing the length; which is often known unless the mprintf
internals are used; then any token passed through that does not pass %s.
So %s cannot be used for UTF8 strings; but rather the literal string
fwrite( buf, 1, stringlen,  Therefore:
>


>
> >> It would be a huge push-up to change this, it would break everything,
> >> including extensions. I don't think it would be possible until something
> >> like sqlite4.
>

maybe I don't understand what you're saying 'it' is.


> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
On 2018-01-26, at 17:05, J Decker  wrote:
> On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
> peter.dasi...@flightaware.com> wrote:
>> Sqlite uses NUL as the string terminator internally, the published API
>> specifies has stuff like this all over the place:

>>> In those routines that have a fourth argument, its value is the number of 
>>> bytes in the parameter. To be clear: the value is the number of bytes in 
>>> the value, not the number of characters. If the fourth parameter to 
>>> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length 
>>> of the string is the number of bytes UP TO THE FIRST ZERO TERMINATOR.

> You stressed the wrong part there - *IS NEGATIVE*

Why? Passing -1 as the length is a common way to tell sqlite3 to calculate the 
length itself. It's a documented and widely used part of the API. Therefore:

>> It would be a huge push-up to change this, it would break everything,
>> including extensions. I don't think it would be possible until something
>> like sqlite4.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
On Fri, Jan 26, 2018 at 1:21 PM, Peter Da Silva <
peter.dasi...@flightaware.com> wrote:

> Sqlite uses NUL as the string terminator internally, the published API
> specifies has stuff like this all over the place:
>
> > In those routines that have a fourth argument, its value is the number
> of bytes in the parameter. To be clear: the value is the number of bytes in
> the value, not the number of characters. If the fourth parameter to
> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length
> of the string is the number of bytes UP TO THE FIRST ZERO TERMINATOR.
>
>
You stressed the wrong part there - *IS NEGATIVE*


> It would be a huge push-up to change this, it would break everything,
> including extensions. I don't think it would be possible until something
> like sqlite4.
>
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Simon Slavin
On 26 Jan 2018, at 9:04pm, J Decker  wrote:

> I bet windows command line tools still use it because copy has /B and /A on
> windows 10.

Windows is indeed a problem.  I don't know enough about it to know whether the 
above statement outlines the problem but Windows in general is terrifically 
difficult to troubleshoot characterset problems in.  There are too many 
combinations of user's preferred code-page, application's code-page, keyboard 
set and thereby the characters the keyboard generates, and control panel 
settings for language.

By all means discuss this problem, do testing in the SQLite command-line tool, 
do testing using text files you can hexdump, and other things.  But please 
don't say "Windows acts like this." because what you really mean is "My version 
of Windows acts like this with this App and such-and-such settings.".  If at 
all possible use some other OS which has fewer interacting settings relating to 
characterset.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
Sqlite uses NUL as the string terminator internally, the published API 
specifies has stuff like this all over the place:

> In those routines that have a fourth argument, its value is the number of 
> bytes in the parameter. To be clear: the value is the number of bytes in the 
> value, not the number of characters. If the fourth parameter to 
> sqlite3_bind_text() or sqlite3_bind_text16() is negative, then the length of 
> the string is the number of bytes UP TO THE FIRST ZERO TERMINATOR.

It would be a huge push-up to change this, it would break everything, including 
extensions. I don't think it would be possible until something like sqlite4.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
On Fri, Jan 26, 2018 at 11:41 AM, Peter Da Silva <
peter.dasi...@flightaware.com> wrote:

> On 1/26/18, 1:37 PM, "sqlite-users on behalf of J Decker" <
> sqlite-users-boun...@mailinglists.sqlite.org on behalf of d3c...@gmail.com>
> wrote:
> >doesn't get 26 either. 0x1a
>
> 26 isn't EOF, it's SUB (substitute). It was used to represent
> untranslatable characters when converting (for example) EBCDIC to ASCII.
>
> I gave up ever using "rt" or "wt"  because it IS EOF; depending on the
system.
I bet windows command line tools still use it because copy has /B and /A on
windows 10.

(interject, edit:  The effect of */a* depends on its position in the
command-line string. When */a* follows *Source*, *copy* treats the file as
an ASCII file and copies data that precedes the first end-of-file character.
https://en.wikipedia.org/wiki/End-of-file

"In Microsoft's DOS  and Windows
 (and in CP/M
 and many DEC
 operating
systems), reading from the terminal will never produce an EOF. Instead,
programs recognize that the source is a terminal (or other "character
device") and interpret a given reserved character or sequence as an
end-of-file indicator; most commonly this is an *ASCII
 Control-Z
**, code 26. *Some
MS-DOS programs, including parts of the Microsoft MS-DOS shell (COMMAND.COM
) and operating-system utility
programs (such as EDLIN ), treat a
Control-Z in a text file as marking the end of meaningful data, and/or
append a Control-Z to the end when writing a text file. This was done for
two reasons:"


... ASCII  Control-Z
, code 26. 
)

I understand probably 0xFF on punch cards was good because you could just
knock out all the holes and make a correction; and that could be an EOF on
other systems.  unless like O_BINARY was used.

so now we just use files as binary and get the length from the system; and
don't expect any transformations on our data.

--
More on my point though

Sqlite results with values with sqlite3_column_text(stmt,n) and
sqlite3_column_bytes(stmt,n) so any data including NUL from bound or
otherwise values is returned.

strcmp() would have an issue.  Even StrNCmp() and really you need a
comparison that inludes length of both strings.
Strlen is used constantly to find lengths of column, table, and function
names for things that should already be known.  It's not like there's a lot
of copying of those; the net effect is more speec; because it's not even
'strlen' that can be auto-intrinsic-inlined, but a fancy function that
sanitizes the length (sqlite3StrLen()).

The commands for SQL
LENGTH, RTRIM, LTRIM, QUOTE that deal with strings...
MySQL returns bytes for length.  Sqlite returns characters.  and all string
functions work on characters, which means sqltie has to understand UTF8
characters

I wouldn't use any of those functions except in a one-of script because
they are non portable.  But they are non conformant because they do support
a basic way of skipping utf characters 0x9X arbitrarily is also not a
valid UTF8 character (it's a continutation that had no leadin length).
SO that makes even the unicode escapes in the range of 0x9X also available
to encode as bytes, kinda OOB with the data.


Invalid characters (overlong and otherwise) should be replaced with FFFD
http://unicode.org/pipermail/unicode/2017-May/005522.html (from this
thread, sort of; was on ill formed utf-8, really the past of this thread
but didn't find it)
https://www.fileformat.info/info/unicode/char/fffd/index.htm
Comments used to replace an incoming character whose value is unknown or
unrepresentable in Unicode
compare the use of U+001A
 as a control
character to indicate the substitute function

(I would have said, "0xFEFF ? ZWNBSP zero width non breaking space(?) EF BB
BF "  but went and searched and found it was different than I thought )
A quick note about UTF8; every byte has one bit off.

My initial impression was that it shouldn't care, being basically a smart
storage engine; since what I put in I could get back out.  Having patched
the input side to escape ' and NUL in string values, I don't need my larger
patch.
But then having looked through so much of the string handling, the overall
effect is still positive.

Then there's internal logging and analysis, which should also escape the
output for strings, there IS a SQL way to include char(0).
can't really change sqlite3_column_text at this point; which means no
matter how much it is enforced and made harder to not count 0 as a
character, it doesn't 

Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
On 1/26/18, 2:34 PM, "sqlite-users on behalf of J. King" 
 
wrote:
> Do you have a point in making either statement? If you do, I'm really not 
> seeing it.

The point is that apart from CP/M and derivatives like DOS, this kind of 
behavior is strictly a leftover from the '60s. And CP/M only had this 
restriction because it was tremendously resource-constrained. It's not a 
precedent for treating some magic character as an end-of-file marker when 
virtually every operating released since 1970 system (apart from a couple that 
derived from this historical anomaly) has had files with byte-precise size 
metadata.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J. King
On 2018-01-26 15:13:46, "Peter Da Silva"  
wrote:


On 1/26/18, 2:11 PM, "sqlite-users on behalf of John McKown" 
 wrote:
​In the distant past (CP/M-80), the filesystem meta data did not 
include the actual _length_ of the data for a text data file.


Since DOS wasn't an OS, then CP/M certainly wasn't.


Do you have a point in making either statement? If you do, I'm really 
not seeing it.


--
J. King

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
On 1/26/18, 2:11 PM, "sqlite-users on behalf of John McKown" 
 wrote:
> ​In the distant past (CP/M-80), the filesystem meta data did not include the 
> actual _length_ of the data for a text data file.

Since DOS wasn't an OS, then CP/M certainly wasn't.
 

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread John McKown
On Fri, Jan 26, 2018 at 1:41 PM, Peter Da Silva <
peter.dasi...@flightaware.com> wr

> On 1/26/18, 1:37 PM, "sqlite-users on behalf of J Decker" <
> sqlite-users-boun...@mailinglists.sqlite.org on behalf of d3c...@gmail.com>
> wrote:
> >doesn't get 26 either. 0x1a
>
> 26 isn't EOF, it's SUB (substitute). It was used to represent
> untranslatable characters when converting (for example) EBCDIC to ASCII.
>

​In the distant past (CP/M-80), the filesystem meta data did not include
the actual _length_ of the data for a text data file. The I/O was done in
sectors. The CP/M-80 system, by convention, used 0x1A (26) and an "logical
EOF" indication and the C routines would detect it and report EOF.​ MS-DOS
basically didthe same thing, for compatibility reasons. I am not sure, but
I think that Windows still does this. A quick test with the command "type
x.txt" where "x.txt" contained "abc~def" (where ~ is standing in for 0x1a)
resulted in my seeing "abc". But "notepad x.txt" shows "abc def". So I
guess it depends on how old the Windows app is.


-- 
I have a theory that it's impossible to prove anything, but I can't prove
it.

Maranatha! <><
John McKown
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
On 1/26/18, 1:37 PM, "sqlite-users on behalf of J Decker" 
 
wrote:
>doesn't get 26 either. 0x1a

26 isn't EOF, it's SUB (substitute). It was used to represent untranslatable 
characters when converting (for example) EBCDIC to ASCII.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
On Fri, Jan 26, 2018 at 10:44 AM, Peter Da Silva <
peter.dasi...@flightaware.com> wrote:

> On 1/26/18, 12:40 PM, "sqlite-users on behalf of J Decker" <
> sqlite-users-boun...@mailinglists.sqlite.org on behalf of d3c...@gmail.com>
> wrote:
> >  reads the bytes and does things with them.  the EOF would get returned
> with fgetc() but not the character.
>
> Fgetc returns an int, not a byte. That EOF is -1, not 0xFF.
>
doesn't get 26 either. 0x1a

>
>
>
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
On 1/26/18, 12:40 PM, "sqlite-users on behalf of J Decker" 
 
wrote:
>  reads the bytes and does things with them.  the EOF would get returned with 
> fgetc() but not the character.

Fgetc returns an int, not a byte. That EOF is -1, not 0xFF.



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
On Fri, Jan 26, 2018 at 10:35 AM, Tim Streater  wrote:

> On 26 Jan 2018, at 18:12, Keith Medcalf  wrote:
>
> > Actually, EOF (0xFF) *is* part of a text file, and is the byte in an
> ASCII
> > byte-stream that indicates end-of-file.
>
> First I've heard of that. Which systems did that then? EOF is normally
> indicated by the file system, not by file data.
>
> the 't' part of fopen( "xxx", "rt" );

reads the bytes and does things with them.  the EOF would get returned with
fgetc() but not the character.

>
> --
> Cheers  --  Tim
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Tim Streater
On 26 Jan 2018, at 18:12, Keith Medcalf  wrote:

> Actually, EOF (0xFF) *is* part of a text file, and is the byte in an ASCII
> byte-stream that indicates end-of-file.

First I've heard of that. Which systems did that then? EOF is normally 
indicated by the file system, not by file data.


-- 
Cheers  --  Tim
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
On 1/26/18, 12:31 PM, "sqlite-users on behalf of J Decker" 
 
wrote:
> ctrl-z was end of file text character in DOS (wrote char 26; not FF)

DOS wasn't an operating system.
 

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
On Fri, Jan 26, 2018 at 10:22 AM, Peter Da Silva <
peter.dasi...@flightaware.com> wrote:

> On 1/26/18, 12:12 PM, "sqlite-users on behalf of Keith Medcalf" <
> sqlite-users-boun...@mailinglists.sqlite.org on behalf of
> kmedc...@dessus.com> wrote:
> > Actually, EOF (0xFF) *is* part of a text file, and is the byte in an
> ASCII byte-stream that indicates end-of-file.  In the "old days" the bytes
> following the last-byte in a stream and the end of a storage block
> (sector/cluster/track/cylinder, what have you) were padded with 0xFF so
> you knew you were past the end-of-the-file when you were reading it.
>
> Oh, I remember the messes that existed before stream files became the
> norm. But messes they were, and there's no more reason to support them in a
> Unicode file than there is to support FIELDDATA format.
>
> And if you're going to talk about the block file and paper tape era, don't
> forget that FF also meant a deleted character and should be skipped without
> being counted or accounted for.
>
>
ctrl-z was end of file text character in DOS (wrote char 26; not FF)
EOF is returned as -1 not 0xFF (although signed char looks really similar)
the character 0xFF is 0xC3 0xBF nof 0xFF.



> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
On 1/26/18, 12:12 PM, "sqlite-users on behalf of Keith Medcalf" 
 
wrote:
> Actually, EOF (0xFF) *is* part of a text file, and is the byte in an ASCII 
> byte-stream that indicates end-of-file.  In the "old days" the bytes 
> following the last-byte in a stream and the end of a storage block 
> (sector/cluster/track/cylinder, what have you) were padded with 0xFF so you 
> knew you were past the end-of-the-file when you were reading it.

Oh, I remember the messes that existed before stream files became the norm. But 
messes they were, and there's no more reason to support them in a Unicode file 
than there is to support FIELDDATA format.

And if you're going to talk about the block file and paper tape era, don't 
forget that FF also meant a deleted character and should be skipped without 
being counted or accounted for.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Keith Medcalf

Actually, EOF (0xFF) *is* part of a text file, and is the byte in an ASCII 
byte-stream that indicates end-of-file.  In the "old days" the bytes following 
the last-byte in a stream and the end of a storage block 
(sector/cluster/track/cylinder, what have you) were padded with 0xFF so you 
knew you were past the end-of-the-file when you were reading it.

Just "more modern" Operating Systems are capable of setting the file length 
more accurately than in the past. And "stream" processors now recognize 
"running out of data" as EOF.  Just because it is now thus does not mean it was 
always so. (And, of course, just because a "stream" has no more data to return 
does not necessarily mean that it is at end-of-file, merely that there is no 
more data to return *at the moment* -- perhaps the card reader is jammed or the 
paper-tape broke :) ).

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of Peter Da Silva
>Sent: Friday, 26 January, 2018 07:30
>To: SQLite mailing list
>Subject: Re: [sqlite] UTF8 and NUL
>
>On 1/26/18, 8:24 AM, "sqlite-users on behalf of Gary R. Schmidt"
><sqlite-users-boun...@mailinglists.sqlite.org on behalf of
>g...@mcleod-schmidt.id.au> wrote:
>> But how would you differentiate EOF???  (Let me guess, 0.  :-) )
>
>End of file is not part of the contents of the file or a string. It's
>metadata.
>
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
On Fri, Jan 26, 2018 at 5:55 AM, Peter Da Silva <
peter.dasi...@flightaware.com> wrote:

> What is the goal of this discussion? Changing the string terminator SQLite
> uses? I think it's almost 50 years too late for that, but I'm sure that if
> Unicode and UTF8 had been a thing in 1970 then C would have selected FF as
> the string terminator.
>
> There's so much resistence to handling NUL in command line tools, test and
in the engine itself, I figured there must be a reason; maybe the
Authentication/Encryption that has been added to sqlite by sqlite people
stores meta data after field content; Such content could still be kept, and
isolated from users with an alternative string terminator;  since that
character is never returned to the user it doesn't matter what sqlite uses
internally (other than having previously used something else)  Is probably
a change that is 3.x to 4.x significant though.


> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
On 1/26/18, 8:24 AM, "sqlite-users on behalf of Gary R. Schmidt" 
 wrote:
> But how would you differentiate EOF???  (Let me guess, 0.  :-) )

End of file is not part of the contents of the file or a string. It's metadata. 

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Gary R. Schmidt

On 27/01/2018 00:55, Peter Da Silva wrote:

What is the goal of this discussion? Changing the string terminator SQLite 
uses? I think it's almost 50 years too late for that, but I'm sure that if 
Unicode and UTF8 had been a thing in 1970 then C would have selected FF as the 
string terminator.
But how would you differentiate EOF???  (Let me guess, 0.  :-) )


Cheers,
GaryB-)
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Peter Da Silva
What is the goal of this discussion? Changing the string terminator SQLite 
uses? I think it's almost 50 years too late for that, but I'm sure that if 
Unicode and UTF8 had been a thing in 1970 then C would have selected FF as the 
string terminator. 

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread Clemens Ladisch
J Decker wrote:
> U+009C 156 String Terminator ST

"ST is used as the closing delimiter of a control string opened by
APPLICATION PROGRAM COMMAND (APC), DEVICE CONTROL STRING (DCS),
OPERATING SYSTEM COMMAND (OSC), PRIVACY MESSAGE (PM), or START OF
STRING (SOS)."


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8 and NUL

2018-01-26 Thread J Decker
https://en.wikipedia.org/wiki/List_of_Unicode_characters#Control_codes
Even the Control codes within unicode aren't FF.

U+009C 156 String Terminator ST
literal bytes \xC2\x9c  are string terminator ... Was thinking that like
APC and ST were higher than that... more in the range of 0xF8-0xFF



On Thu, Jan 25, 2018 at 7:57 PM, J Decker  wrote:

> NUL is a valid utf8 character
> but FF is never valid.  (would be like a 36 bit length specification)
> and practically anthing more than F8 is invalid utf8 character.
> Other than BOM
> https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
> EF BB BF 239 187 191
>
> // EF - 80 | 3b - 80 | 3f
> ( 0xfeff  )
>
>
> Many Windows  programs
> (including Windows Notepad
> ) add the bytes 0xEF,
> 0xBB, 0xBF at the start of any document saved as UTF-8. Th
>
> (Not that BOM is even required, because, it's already ordered bytes)
> --
> But anYway FF could be used as a string terminator instead of 00.  It is
> never legal in any utf-8 sequence.
> (F8,F9,FA,FB,FC,FD,FE,FF)
> F8 would be a 5 byte encoding, but that is more code points than unicode
> has allocated.  It could be potentially useful to permit a little extra
> space in sequences , so I would avoid F8(F9,FA,FB) and stick to FC-FF for
> possible control characters.
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users