Re: [sqlite] Unexplained table bloat

2020-01-13 Thread James K. Lowden
On Sun, 12 Jan 2020 16:24:53 -0700
"Keith Medcalf"  wrote:

> while one may be tempted to claim that "consistency is the hobgoblin
> of little minds" 

You might have forgotten that the much overpraised Ralph Waldo specified
"a foolish consistency".  He only meant: don't try to hang your hat on
too small a peg.  

> Doing this does not really do much since you still have to check the
> type on retrieval of the value anyway in order to know what to do
> with it.

That depends what "much" is.  

The value of constraints used to enforce types is to reject from the
database values outside the domain.  That not only simplifies
application logic, but also the logical consistency of the queries
themselves.  

If "year" is always an integer -- never NULL, never a string -- then 

avg(year) 
and
count(year) 

are always correct.  But if the database contains for "year" a string
like "it was a good one", or NULLs, they're both unreliable.  

--jkl
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread James K. Lowden
On Sun, 12 Jan 2020 15:48:05 -0700
"Keith Medcalf"  wrote:

> SQL is simply a Structured Query Language that can be overlaid on a
> variety of datastore models, one of which is the Relational model.  

I think that's misleading, Keith, for a couple of reasons. 

SQL was invented by IBM expressly and specfically for its "relational"
product.  Its wordiness is the product of its roots in "4th generation"
languages in vogue at the time, the promise of which was to permit
users to "write their own reports".  Hence the strict select-from-where
syntax, meant to be so simple that untrained users could figure it
out.  

While its true, as you say, that many pre-relational systems (and
post-relational ones, later, feh) added SQL on their shingles, those
were never coherent implementations.  They were a way to say Yes! when
the question was, "Do you support SQL?"

> Many (most in fact) datastores require that all instances of the same
> "column" in an "entity" be the same type

To the extent that SQL implements relational algebra/calculus, its
utility and consistency *requires* that each column be of a defined
type.  For example, if the query includes, 

where A between 0 and 1

and A is a *not* a numeric value, then the query is reduced from
rigorous first-order predicate logic to gibberish.  

SQLite serves a particular niche very well.  The choice not to enforce
type constraints for declared datatypes biases the system toward
ease-of-insertion.  That has advantages in the SQLite problem domain,
but the trade-off comes at a price: it makes SQLite *harder* to use in a
rigorous way.  

--jkl
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread Keith Medcalf

On Monday, 13 January, 2020 06:36, Dominique Devienne  
wrote:

> Please remind me, is the encoding a "client-side" setting, or also a
> "server-side" (i.e. stored) setting?

"pragma encoding " sets the internal storage format for text encoding in a 
database (on the "inside SQLite" side of the API demarcation).  It is 
persistent (a database attribute) and cannot be changed once the database has 
been created.

The boundary between "internal" and "external" is the SQLite3 API.  Thus you 
send "external" text into SQLite using the sqlite3_bind_text* 
sqlite3_result_text* interfaces.  The call specifies the format you are using 
for the "external" representation.  When the text gets "internal" (to the other 
side of the API) it is stored (and converted if necessary to) the "pragma 
encoding" setting of the database.  You retrieve the "internal" values into 
your external program using sqlite3_column_text* and sqlite3_value_text* 
interfaces, which converts (if necessary) the data from the "internal" encoding 
to the external encoding you requested.

https://sqlite.org/pragma.html#pragma_encoding

The length() function is an external function (as in on the "exernal" side of 
the demarcation).  It retrieves its arguments using the "sqlite3_value" 
interfaces from the internal storage format into the requested external storage 
format, does its thing, and then returns an external result back into the 
"internal" part of SQLite using the sqlite3_result* interface.

Contrast this with a function such as cast(...) which is strictly internal.  It 
operates entirely internally to SQLite directly on the internal data and does 
not retrieve the "internal data" into an external format then send the 
converted external data back into SQLite.

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.




___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread David Raymond
Well, I believe this is the relevant bit from the docs for binding:
https://www.sqlite.org/c3ref/bind_blob.html

"If a non-negative fourth parameter is provided to sqlite3_bind_text() or 
sqlite3_bind_text16() or sqlite3_bind_text64() then that parameter must be the 
byte offset where the NUL terminator would occur assuming the string were NUL 
terminated. If any NUL characters occur at byte offsets less than the value of 
the fourth parameter then the resulting string value will contain embedded 
NULs. The result of expressions involving strings with embedded NULs is 
undefined. "


-Original Message-
From: sqlite-users  On Behalf Of 
Barry Smith
Sent: Monday, January 13, 2020 1:54 PM
To: SQLite mailing list 
Subject: Re: [sqlite] Unexplained table bloat

On the original topic...

How does one end up with a database in this state? I.e with a binary value that 
contains 0x00 bytes followed by other bytes but a type of TEXT?

If the definition of a text string in SQLite is that it ends at the first 0x00 
byte, then it seems that anything stored as a text string should adhere to that.

So a database with a TEXT value that contains characters after the first 0x00 
should be considered corrupt. Given that to retrieve the actual contents of the 
cell it must be cast to BLOB, why not force the storage of any string that 
contains 0x00 as a BLOB in the first place?

What am I missing here?

On 13 Jan 2020, at 6:02 am, Simon Slavin  wrote:
> 
> On 13 Jan 2020, at 9:26am, Dominique Devienne  wrote:
> 
>> Which implies length(text_val) is O(N), while
>> length(blob_val) is O(1),
>> something I never quite realized.
> 
> For this reason, and others discussed downthread, some languages which store 
> Unicode strings store the number of graphemes as well as its contents.  So 
> functions which care about the … let's call it "width" … just retrieve that 
> number rather than having to parse the string to figure out the length.
> 
> In a Unicode string 'length' can mean
> 
> 1) octet count (number of 8-bit bytes used to store the string)
> 2) number of code points (basic unicode unit)
> 3) number of code units (how code points get arranged in UTF8, UTF16, etc., 
> not as simple as it looks)
> 4) length in graphemes (space-using units)
> 5) length in glyphs (font-rendering units)
> 
> and probably others I've forgotten.  Not to mention that I simplified the 
> definitions of the above and may have got them wrong.
> 
> An application centred around rendering text (e.g. vector graphics drawing 
> apps) might have each piece of text stored with all five of those numbers, 
> just to save it from having to constantly recalculate them.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread Barry Smith
On the original topic...

How does one end up with a database in this state? I.e with a binary value that 
contains 0x00 bytes followed by other bytes but a type of TEXT?

If the definition of a text string in SQLite is that it ends at the first 0x00 
byte, then it seems that anything stored as a text string should adhere to that.

So a database with a TEXT value that contains characters after the first 0x00 
should be considered corrupt. Given that to retrieve the actual contents of the 
cell it must be cast to BLOB, why not force the storage of any string that 
contains 0x00 as a BLOB in the first place?

What am I missing here?

On 13 Jan 2020, at 6:02 am, Simon Slavin  wrote:
> 
> On 13 Jan 2020, at 9:26am, Dominique Devienne  wrote:
> 
>> Which implies length(text_val) is O(N), while
>> length(blob_val) is O(1),
>> something I never quite realized.
> 
> For this reason, and others discussed downthread, some languages which store 
> Unicode strings store the number of graphemes as well as its contents.  So 
> functions which care about the … let's call it "width" … just retrieve that 
> number rather than having to parse the string to figure out the length.
> 
> In a Unicode string 'length' can mean
> 
> 1) octet count (number of 8-bit bytes used to store the string)
> 2) number of code points (basic unicode unit)
> 3) number of code units (how code points get arranged in UTF8, UTF16, etc., 
> not as simple as it looks)
> 4) length in graphemes (space-using units)
> 5) length in glyphs (font-rendering units)
> 
> and probably others I've forgotten.  Not to mention that I simplified the 
> definitions of the above and may have got them wrong.
> 
> An application centred around rendering text (e.g. vector graphics drawing 
> apps) might have each piece of text stored with all five of those numbers, 
> just to save it from having to constantly recalculate them.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread Simon Slavin
On 13 Jan 2020, at 9:26am, Dominique Devienne  wrote:

> Which implies length(text_val) is O(N), while
> length(blob_val) is O(1),
> something I never quite realized.

For this reason, and others discussed downthread, some languages which store 
Unicode strings store the number of graphemes as well as its contents.  So 
functions which care about the … let's call it "width" … just retrieve that 
number rather than having to parse the string to figure out the length.

In a Unicode string 'length' can mean

1) octet count (number of 8-bit bytes used to store the string)
2) number of code points (basic unicode unit)
3) number of code units (how code points get arranged in UTF8, UTF16, etc., not 
as simple as it looks)
4) length in graphemes (space-using units)
5) length in glyphs (font-rendering units)

and probably others I've forgotten.  Not to mention that I simplified the 
definitions of the above and may have got them wrong.

An application centred around rendering text (e.g. vector graphics drawing 
apps) might have each piece of text stored with all five of those numbers, just 
to save it from having to constantly recalculate them.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread Dominique Devienne
On Mon, Jan 13, 2020 at 1:10 PM Keith Medcalf  wrote:
> If the register object contains "text" and you cast it to a blob (remove the 
> text affinity) you are left with just the bag-o-bytes, and length() will 
> return the size of the bag encoded in the register.  If the data in the 
> register is other than type "text" then it must be converted to text first 
> (in the database encoding) and then the cast will remove the text affinity, 
> after which the value returned by the length() function will be the number of 
> bytes in the bag that holds that text representation:
>
> sqlite> pragma encoding='utf-16';
> sqlite> create table x(x);
> sqlite> insert into x values ('text' || char(0) || 'text');
> sqlite> select x, typeof(x), length(x), length(cast(x as blob)) from x;
> text|text|4|18

Please remind me, is the encoding a "client-side" setting, or also a
"server-side" (i.e. stored) setting?

I wasn't sure whether pragma encoding='utf-16' affected the stored
state as well, or whether it was always in UTF-8
and SQLite was doing conversion on the fly for the client requested
encoding. I thought of lengthof() as the size stored
in the value header itself, which I assumed was always in bytes.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread Richard Damon

On 1/13/20 5:24 AM, Dominique Devienne wrote:

On Mon, Jan 13, 2020 at 11:07 AM Keith Medcalf  wrote:

On Monday, 13 January, 2020 02:27, Dominique Devienne  
wrote:

I'd vote for a lengthof(col) that's always O(1) for both text and blob

So what should lengthof(something) return the number of bytes in the 'database 
encoding' or something else?

Bytes of course. Of the data stored, i.e. excluding the header byte
and encoded size (if any) from the file-format.
Basically the same as length() *except* for text values, resulting in
O(1) behavior. --DD

PS: I keep forgetting length(text_val) returns the number of
code-points in fact :)
PPS: Surrogate pairs count as one or two code points? That's just
bait, I don't really want to know :)))
Re the PPS, UTF-8 isn't allowed to have Surrogate Pairs. Non-BMP 
characters which would use Surrogate Pairs in UTF-16 are supposed to be 
converted to their fundamental 21 bit value and that encoded into UTF-8. 
If the code doesn't validate the data well enough to catch that issue, 
then I suspect the character counting would count each half of the 
surrogate pairs as a code-point,


--
Richard Damon

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread Keith Medcalf

If the register object contains "text" and you cast it to a blob (remove the 
text affinity) you are left with just the bag-o-bytes, and length() will return 
the size of the bag encoded in the register.  If the data in the register is 
other than type "text" then it must be converted to text first (in the database 
encoding) and then the cast will remove the text affinity, after which the 
value returned by the length() function will be the number of bytes in the bag 
that holds that text representation:

SQLite version 3.31.0 2020-01-12 23:30:01
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> pragma encoding;
UTF-8
sqlite> create table x(x);
sqlite> insert into x values ('text' || char(0) || 'text');
sqlite> insert into x values (3.14159);
sqlite> select x, typeof(x), length(x), length(cast(x as blob)) from x;
text|text|4|9
3.14159|real|7|7
sqlite> .q

SQLite version 3.31.0 2020-01-12 23:30:01
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> pragma encoding='utf-16';
sqlite> pragma encoding;
UTF-16le
sqlite> create table x(x);
sqlite> insert into x values ('text' || char(0) || 'text');
sqlite> insert into x values (3.14159);
sqlite> select x, typeof(x), length(x), length(cast(x as blob)) from x;
text|text|4|18
3.14159|real|7|14
sqlite> .q

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread R Smith

On 2020/01/13 12:24 PM, Dominique Devienne wrote:
Bytes of course. Of the data stored, i.e. excluding the header byte 


I checked, I was apparently correct about the casting. This following 
extract from a DB I got from a forum member with Greek-to-Danish 
translations where I added length checks:


  -- SQLite version 3.30.1  [ Release: 2019-10-10 ]  on SQLitespeed 
version 2.1.3.11.
  -- 



SELECT Glose_ID, Græsk, Dansk, length(Dansk) AS len, length(CAST(Dansk 
AS BLOB)) AS bytelen

  FROM Gloser
 WHERE 1 LIMIT 20;


  --   Glose_ID   | Græsk  | 
Dansk    | len | bytelen
  --  | -- | 
 | --- | ---
  --   1  | απόδειξη η | 1. bevis  2. 
kvittering  |  23 |    23
  --   2  | δεικτική αντωνυμία | påpegende stedord, 
demonstrativ pronomen |  40 | 41
  --   3  | δεικτικός-ή-ό  | 
påpegende    | 9  |    10
  --   4  | καθόλου    | slet (ikke), overhovedet 
(ikke)  |  31 |    31
  --   5  | κώδικας ο  | 1. kode 2. 
lov   |  14 |    14
  --   6  | πλην   | (mat) minus; (præp + G) 
undtagen |  32 |    33
  --   7  | προσδιορίζω    | præcisere, fastsætte, bestemme 
(nøjagtigt)   |  42 |    45
  --   8  | προσδιορισμός ο    | fastsættelse, (nøjagtig) 
bestemmelse |  36 |    38
  --   9  | προσδιορισής ο | ?? (- RH) foremntlig samme 
betydning som προσδιορισμός   |  54 | 67
  --  10  | φακός ο    | 1. linse  2. glas, brilleglas  
3. lygte, lommelygte  |  51 |    51
  --  11  | βρώμικος-η-ο   | snavset, beskidt; (fig) 
lyssky, skummel  |  39 | 39
  --  12  | δε μου λες...  | sig mig engang… (egl “du 
fortæller mig ikke”)    |  45 |    52
  --  13  | δημοσιογράφος ο/η  | 
journalist   | 10 |    10
  --  14  | κέρμα το   | mønt, 
småmønt    |  13 |    16
  --  15  | κοπέλα η   | 
pige | 4  |    4
  --  16  | μαθήτρια η | elev, skoleelev 
(kvindelig)  |  27 |    27
  --  17  | μαθητής ο  | elev, skoleelev 
(mandlig)    |  25 |    25
  --  18  | μπουκάλι το    | 
flaske   | 6  |    6
  --  20  | ποτήρι το  | glas (Både glas og flaske er 
intetkøn som vin. Øl er hunkøn) |  60 |    64
  --  21  | ίσιος-α-ο  | 
lige | 4  |    4



Hope that format is not too messed up by my e-mailer.


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread R Smith


On 2020/01/13 12:24 PM, Dominique Devienne wrote:

On Mon, Jan 13, 2020 at 11:07 AM Keith Medcalf  wrote:

On Monday, 13 January, 2020 02:27, Dominique Devienne  
wrote:

I'd vote for a lengthof(col) that's always O(1) for both text and blob

So what should lengthof(something) return the number of bytes in the 'database 
encoding' or something else?

Bytes of course. Of the data stored, i.e. excluding the header byte
and encoded size (if any) from the file-format.
Basically the same as length() *except* for text values, resulting in
O(1) behavior. --DD



Please have mercy - my beautiful text-extracty queries will no longer work!

I'm OK with a second other function, perhaps called bytelength() that 
returns what you suggest, but length is mostly used in any text 
extracting or parsing along with with other functions like instr() and 
substr() etc. which all take character indexes based on positions of 
code-points and not of actual bytes.


Back in the day when all text was ANSI ASCII 8-byte Latin-English 
characters this was easy, but now it ain't.


That said, I thought that if you cast a string to BLOB and then query 
the length, like: SELECT length(CAST(firstname AS BLOB)) AS bytelen;
You essentially get what you asked for, but I've never used it, so am 
not sure.





___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread Dominique Devienne
On Mon, Jan 13, 2020 at 11:07 AM Keith Medcalf  wrote:
> On Monday, 13 January, 2020 02:27, Dominique Devienne  
> wrote:
> >> I'd vote for a lengthof(col) that's always O(1) for both text and blob
>
> So what should lengthof(something) return the number of bytes in the 
> 'database encoding' or something else?

Bytes of course. Of the data stored, i.e. excluding the header byte
and encoded size (if any) from the file-format.
Basically the same as length() *except* for text values, resulting in
O(1) behavior. --DD

PS: I keep forgetting length(text_val) returns the number of
code-points in fact :)
PPS: Surrogate pairs count as one or two code points? That's just
bait, I don't really want to know :)))
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread Keith Medcalf

On Monday, 13 January, 2020 02:27, Dominique Devienne  
wrote:

>On Fri, Jan 10, 2020 at 7:03 PM Richard Hipp  wrote:

>> On 1/10/20, Dominique Devienne  wrote:

>>> There's no way at all, to know the length of a text column with
>>> embedded NULLs?

>> You can find the true length of a string in bytes from C-code using
>> the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
>> of a way to do that from SQL.

> That's what I thought. Which implies length(text_val) is O(N), while
> length(blob_val) is O(1), something I never quite realized. And this 
> despite both storing the length the same way at the file-format level. 
> That's kind of a gotcha, which might be worth documenting.

No, they are measured in different units.  

length(something) where something is NOT a text value returns the length in 
BYTES of that something.  For a BLOB this is the number of bytes according to 
the length data stored in the database.  For NULL, INTEGER, or REAL values, it 
is the number of BYTES in the ASCII text representation of that thing (which 
also happens to be the number of characters since each ASCII character takes 
one byte).

length(something) where something IS a text value returns the number of 
codepoints in the UTF-8 representation of that text and counting stops at the 
zero terminator (since the definition of "text" is a C string, all text values 
have an appended 0 terminator -- if there is an embedded 0 character, that 
terminates the counting).

>> I'd vote for a lengthof(col) that's always O(1) for both text and blob
>> values, although I'm not holding my breath for it, as I suspect it's 
>> unlikely to be added, given its relative low value.

So what should lengthof(something) return the number of bytes in the 'database 
encoding' or something else?

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-13 Thread Dominique Devienne
On Fri, Jan 10, 2020 at 7:03 PM Richard Hipp  wrote:

> On 1/10/20, Dominique Devienne  wrote:
> > There's no way at all, to know the length of a text column with embedded 
> > NULLs?
>
> You can find the true length of a string in bytes from C-code using
> the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
> of a way to do that from SQL.

That's what I thought. Which implies length(text_val) is O(N), while
length(blob_val) is O(1),
something I never quite realized. And this despite both storing the
length the same way at the
file-format level. That's kind of a gotcha, which might be worth documenting.

I'd vote for a lengthof(col) that's always O(1) for both text and blob
values, although I'm not
holding my breath for it, as I suspect it's unlikely to be added,
given its relative low value.

I don't disagree with Keith that text shouldn't contain embedded
NULLs, and that C-String are
by design and definition that way, I was more thinking of the C++ use
case of having an std::string
with embedded NULLs, which is perfectly OK and common enough, and
using a C++ wrapper for
SQLite (which typically uses overloading for binding for example),
which will insert a text value for that
case, using .c_str() + .length() (or .data() + .size(), doesn't
matter, ends up the same), leading to the
very issue that started this thread (just a guess).

The inability to correctly size a value in SQL (and thus a column, via
a sum() group by)
for text with embedded nulls is quite unfortunate. And the fact
length(text_col) is also O(N)
is similarly unfortunate. Thus the above idea of an O(1)
lengthof(col), as a companion to typeof(col).

Thanks, --DD
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-12 Thread R Smith

On 2020/01/13 12:25 AM, Tom Browder wrote:

On Sun, Jan 12, 2020 at 14:05 Keith Medcalf  wrote:

Close, but no banana. Every value has a type. A column may contain 
multiple values (as in one per row) 


Thanks, Keith.

I assume that is just for SQLite, or am I wrong again?



You are not wrong. This comes up from time to time and is always quite 
interesting.



As Keith explained, the Relational Model can be applied on most kinds of 
data-stores. SQLite's data store happens to allow per-value typing, and 
then only using "Duck typing" (https://en.wikipedia.org/wiki/Duck_typing).


While on the subject of SQLite peculiarities, there's a few to note if 
you are new to SQLite, like:
- You don't need any type for a column, but untyped columns are treated 
as having blob affinity, not text.
- A column declared as VARCHAR(30), or TEXT(30) or INT(11) for that 
matter, will be fully accepted, but there is no actual length 
constraint. You can put any length value in the column.
- NULL values are distinct from each other (in some cases) so that if A 
is NULL and B is NULL, then A = B returns FALSE (0) and A <> B also 
returns FALSE!. This is useful, but note that a Primary Key in SQLite 
without also having the NOT NULL constraint, will allow duplicate NULL 
values in the key. (the row-id alias being the exception).
- While on the point, the special type "INTEGER PRIMARY KEY" is an Alias 
for the internal index (row-id) of the table, and so doesn't allow NULLs.
- While on it still... Tables don't always have row-ids, and there is no 
easy check to know if it does, so unless you made the DB yourself, you 
can't safely just query the row-id.
- Since typing is per value, it is not an error in SQLite for Parent and 
Child Key columns in a Foreign-Key relationship to have different types 
- this can have some unexpected results!
- A datetime is a Numeric type expressed as text (ISO8601) and doesn't 
inherently know anything about time or time-zones. (The date-handling 
functions work amazingly well though).

- A column with the type STRING will have Integer affinity.
- A spelling error in your schema, such as CREATE TABLE t(id INT, name 
T3XT); will not be an error - SQLite will silently regard that column 
affinity as blob.
- Quotes are more or less ambivalent... CREATE TABLE t("id" int, [name] 
text, `age` [int]) is a perfectly valid schema statement, and Set a = 
"John" can mean different things based on whether there is a column 
named John or not.


I'm probably not remembering all of them now, but we've spent lots of 
time musing about it before (forum searches will probably provide a host 
of discussions, including much lobbying for a "strict" mode) and because 
of all that, there's a fun feature added in SQLitespeed that does 
Schema-checking and prints a list of warnings if it contains one or more 
of these SQLite quirks (including misspelled types, which helps me a 
lot), and on the SQLite site there's also a section on some of these 
peculiarities (https://www.sqlite.org/quirks.html).


I'm hoping someone else will add the quirks which I forgot about :)

We all came to love (mostly) and often use these quirks to some 
advantage, but it pays to be aware of them, especially coming from 
another DB architecture where typing and the like are more rigid.



Cheers!
Ryan



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-12 Thread Keith Medcalf

On Sunday, 12 January, 2020 15:31, Simon Slavin  wrote:

>You're generally right.  SQLite always uses affinities (more or less
>'weak typing') rather than strong typing.  I don't know of any other SQL
>implementations which allow this without explicit declaration, and most
>don't allow it at all.

SQLite is strongly typed however "entities" (table rows) are composed of a 
fixed number of dynamically typed values.  A collection of entities (a table) 
has an affinity (preference for a particular type if possible) for each 
"column" in the collection.

This is not "weak typing" nor is it "duck typing".  It is "dynamic typing".

>You can enforce strong typing in SQLite using a constraint.  But most
>people enforce type in their own code, before the value gets to SQLite.

Well, no.  You can restrict the size of the dynamic though a check constraint, 
but you cannot make the members of entities strongly typed.  For example:

create table x
(
  c text check (typeof(c) is 'text')
);

does not strongly type the column "c" of entity collection "x".  What it does 
is ensure that only text values are stored in that particular dynamically typed 
column of the entity.  You will note that the declaration is inconsistent -- 
the two available correct forms would be:

create table x
(
  c text not null check (typeof(c) is 'text')
);

and

create table x
(
  c text check (typeof(c) in ('null', 'text'))
);

while one may be tempted to claim that "consistency is the hobgoblin of little 
minds" its converse, inconsistency, is often a source of errors.

Some affinities are more complicated.  For example:

create table x
(
  c numeric check (typeof(c) in ('null', 'real', 'integer'))
);

both 'real' and 'integer' are required for numeric affinity enforcement, and 
the 'null' because the column may contain nulls.

Doing this does not really do much since you still have to check the type on 
retrieval of the value anyway in order to know what to do with it.

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-12 Thread Keith Medcalf

On Sunday, 12 January, 2020 15:29, Richard Damon  
wrote:

>On 1/12/20 5:25 PM, Tom Browder wrote:

>> On Sun, Jan 12, 2020 at 14:05 Keith Medcalf  wrote:

>>> On Sunday, 12 January, 2020 09:03, Tom Browder  
>>> wrote:

 Am I missing something? I thought every column has to have a type?

>>> Close, but no banana.  Every value has a type.  A column may contain
>>> multiple values (as in one per row)

>> I assume that is just for SQLite, or am I wrong again?

> That the entries for a given column in different rows can have different
> types is a peculiarity of SQLite. In a 'Standard' SQL database, a column
> has a defined type, and all rows will have values of that type (or NULL).

It is a peculiarity of the underlying datastore used by SQLite.  Many (most in 
fact) datastores require that all instances of the same "column" in an "entity" 
be the same type -- some do not (SQLite is one of them).  SQL is simply a 
Structured Query Language that can be overlaid on a variety of datastore 
models, one of which is the Relational model.  When SQL is used on other 
database organizations it is entirely possible for the "type" of a particular 
returned "column" to vary by row as it may have been fetched from different 
entities.  DB-Vista, MDBS, and NOMAD are a couple of CODASYL style databases 
which have (optional extra) SQL query interfaces that can return data of 
multiple value types row by each for the same column.

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-12 Thread Tim Streater
On 12 Jan 2020, at 22:25, Tom Browder  wrote:

> On Sun, Jan 12, 2020 at 14:05 Keith Medcalf  wrote:
>
>> On Sunday, 12 January, 2020 09:03, Tom Browder 
>> wrote:
>> >Am I missing something? I thought every column has to have a type?
>>
>> Close, but no banana.  Every value has a type.  A column may contain
>> multiple values (as in one per row)

> I assume that is just for SQLite, or am I wrong again?

In general that's correct. Lengths of VARCHARs are ignored, too, as you 
probably saw.




-- 
Cheers  --  Tim
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-12 Thread Simon Slavin
On 12 Jan 2020, at 10:25pm, Tom Browder  wrote:

> I assume that is just for SQLite, or am I wrong again?

You're generally right.  SQLite always uses affinities (more or less 'weak 
typing') rather than strong typing.  I don't know of any other SQL 
implementations which allow this without explicit declaration, and most don't 
allow it at all.

You can enforce strong typing in SQLite using a constraint.  But most people 
enforce type in their own code, before the value gets to SQLite.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-12 Thread Richard Damon

On 1/12/20 5:25 PM, Tom Browder wrote:

On Sun, Jan 12, 2020 at 14:05 Keith Medcalf  wrote:


On Sunday, 12 January, 2020 09:03, Tom Browder 
wrote:

Am I missing something? I thought every column has to have a type?

Close, but no banana.  Every value has a type.  A column may contain
multiple values (as in one per row)


Thanks, Keith.

I assume that is just for SQLite, or am I wrong again?

-Tom
That the entries for a given column in different rows can have different 
types is a peculiarity of SQLite. In a 'Standard' SQL database, a column 
has a defined type, and all rows will have values of that type (or NULL).


--
Richard Damon

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-12 Thread Tom Browder
On Sun, Jan 12, 2020 at 14:05 Keith Medcalf  wrote:

> On Sunday, 12 January, 2020 09:03, Tom Browder 
> wrote:
> >Am I missing something? I thought every column has to have a type?
>
> Close, but no banana.  Every value has a type.  A column may contain
> multiple values (as in one per row)


Thanks, Keith.

I assume that is just for SQLite, or am I wrong again?

-Tom
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-12 Thread Keith Medcalf

On Sunday, 12 January, 2020 09:03, Tom Browder  wrote:

>Am I missing something? I thought every column has to have a type?

Close, but no banana.  Every value has a type.  A column may contain multiple 
values (as in one per row).  Therefore each of those values has a type, which 
may be different from the type of the value in the same column on another row.  
Columns have an affinity, which is a preference for the type of the value to be 
stored in that columns' rows if conversion from the value provided to be stored 
to the specified affinity is possible, lossless, and reversible.

https://sqlite.org/datatype3.html

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-12 Thread Tom Browder
Am I missing something? I thought every column has to have a type?

-Tom
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-11 Thread Kevin Youren

Hi,

I checked the download mentioned in the original email. Not sure if the
table changed since the previous posts.


It seems LUTFullString has 3 BLOB rows, but LENGTH treats them as
strings.

I'm in Melbourne, Oz, so I added the UTC datetime.

regs, Kev

kevin@KCYDell:~$ cd /mnt/KCY/KCYDocs/
kevin@KCYDell:/mnt/KCY/KCYDocs$ sqlite3 /mnt/KCY/KCYDocs/200k-per-
row.sqlite
SQLite version 3.30.1 2019-10-10 20:19:45
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);
sqlite> 
sqlite> select '1',rowid,LENGTH(HEX(LUTFullString))/2 from copied
   ...> UNION ALL
   ...> select '2',rowid,INSTR(HEX(LUTFullString),'00') from copied
   ...> UNION ALL
   ...> select '3',rowid,substr(HEX(LUTFullString),0,10) from copied
   ...> UNION ALL
   ...> select '4',rowid,INSTR(SUBSTR(HEX(LUTFullString),4),'00') from
copied
   ...> UNION ALL
   ...> select '5',rowid,LENGTH(LUTFullString) from copied;
1|1|194238
1|2|183050
1|3|193908
2|1|3
2|2|3
2|3|3
3|1|2C0003007
3|2|2C0003007
3|3|2C0003007
4|1|1
4|2|1
4|3|1
5|1|1
5|2|1
5|3|1
sqlite> .quit
kevin@KCYDell:/mnt/KCY/KCYDocs$ date -u
Sat 11 Jan 23:39:43 UTC 2020
kevin@KCYDell:/mnt/KCY/KCYDocs$ 









Message: 6
Date: Fri, 10 Jan 2020 08:48:21 -0500
From: Ryan Mack 
To: sqlite-users@mailinglists.sqlite.org
Subject: [sqlite] Unexplained table bloat
Message-ID:
<
CABhGdGRbR1kT+3_BU6ob9L7tpSPZ09HJn=ofPyK6OXvgQK=_...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

Hi list,

I'm trying to understand unexplained table bloat I found in what should
be
a very small table in an old database file. If you dump/restore the
database, the problem goes away. If you duplicate the table, the
problem
propagates. Schema:

CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

Test Data:
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

In my test database file, each of these 3 rows is allocating about 40
4k
overflow pages for a total database size of about 500k. The full
database
has about 4MB of actual data which takes up over 500MB on disk. If you
want
to see/reproduce the problem you'll need my test database file which
I've
uploaded here: https://mackman.net/200k-per-row.sqlite.zip (500kb
download). I don't know why it doesn't compress better, those extra
overflow pages must be getting filled with random garbage.

My uninformed guess is there was a bug in the version of sqlite used at

database creation time that computed an incorrect overflow threshold
and is
storing each byte of the row to its own page. Since the problem goes
away
with a dump/restore, I'm considering releasing a script to do that and
mitigate the problem for affected users. Before doing that I would like
to
understand the problem better.

Thanks for reading, Ryan

PS: Here's some output from my debug session showing the 123 bytes of
data
is occupying 582k of space on disk in a freshly created table.

% sqlite3 200k-per-row.sqlite
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> .tables
copied

sqlite> .schema copied
CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

sqlite> select * from copied;
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

sqlite> select sum(length(id_local) + length(lutfullstring) +
length(luthash)) from copied;
123

sqlite> create table copied2 as select * from copied;
sqlite> SELECT name, SUM("pgsize") as sz FROM dbstat group by name;
copied|581632
copied2|581632
sqlite_master|4096



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Richard Damon

On 1/10/20 2:24 PM, Tim Streater wrote:

On 10 Jan 2020, at 18:55, Keith Medcalf  wrote:


On Friday, 10 January, 2020 11:44, Tim Streater  wrote:


On 10 Jan 2020, at 18:03, Richard Hipp  wrote:

On 1/10/20, Dominique Devienne  wrote:

There's no way at all, to know the length of a text column with
embedded NULLs?

You can find the true length of a string in bytes from C-code using
the sqlite3_column_bytes() interface. But I cannot, off-hand, think
of a way to do that from SQL.

But if I store UTF-8 in a TEXT column, surely I'm allowed to include
NULLs in that? They are after all valid UTF-8 characters.

No, they are not. The "NUL character" in Modified UTF-8 is the two-byte
sequence 0xC0 0x80. This is specifically so that 0x00 can be used as a string
terminator. Validly encoded UTF-8 encoded text stored in a C String (0x00
terminated sequence of bytes) must not contain an embedded 0x00 byte since
that byte terminates the sequence.

Nice, but Wikipedia has that as a "Derivative" and "incompatible with the UTF-8 
specification and may be rejected by conforming UTF-8 applications." It appears (though I may 
have missed it) not to be mentioned on this handy site either:

https://www.utf8-chartable.de/unicode-utf8-table.pl

I shall have to check what my preferred language's wrapper does.


It is incompatible, in the sense that it uses an encoding that the UTF-8 
specification says in invalid, and thus an application that performs 
fully all the tests on valid data forms would reject it. In many ways it 
is a compatible extension in that excluding the test that specifically 
makes the form invalid, doing the processing by the general rules of 
UTF-8, gives the expected result.


C Strings do not allow 0 bytes in them. This would normally mean that 
they do not allow the NUL character to be in a string. This extension 
allows a character which would be interpreted as the NUL character to be 
represented without needing a 0 byte.


It should be pointed out that most libraries won't be checking all the 
strings that pass through them to see if they violate the rule, as that 
is just adding a lot of overhead for very little benefit. It is really 
expected that applications will do this sort of test at the borders, 
when possibly untrusted strings come in, and know that if good strings 
come in, the following processing will keep the strings valid.


--
Richard Damon

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Tim Streater
On 10 Jan 2020, at 18:55, Keith Medcalf  wrote:

> On Friday, 10 January, 2020 11:44, Tim Streater  wrote:
>
>>On 10 Jan 2020, at 18:03, Richard Hipp  wrote:
>
>>> On 1/10/20, Dominique Devienne  wrote:
 There's no way at all, to know the length of a text column with
 embedded NULLs?
>
>>> You can find the true length of a string in bytes from C-code using
>>> the sqlite3_column_bytes() interface. But I cannot, off-hand, think
>>> of a way to do that from SQL.
>
>>But if I store UTF-8 in a TEXT column, surely I'm allowed to include
>>NULLs in that? They are after all valid UTF-8 characters.
>
> No, they are not. The "NUL character" in Modified UTF-8 is the two-byte
> sequence 0xC0 0x80. This is specifically so that 0x00 can be used as a string
> terminator. Validly encoded UTF-8 encoded text stored in a C String (0x00
> terminated sequence of bytes) must not contain an embedded 0x00 byte since
> that byte terminates the sequence.

Nice, but Wikipedia has that as a "Derivative" and "incompatible with the UTF-8 
specification and may be rejected by conforming UTF-8 applications." It appears 
(though I may have missed it) not to be mentioned on this handy site either:

https://www.utf8-chartable.de/unicode-utf8-table.pl

I shall have to check what my preferred language's wrapper does.


-- 
Cheers  --  Tim
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Richard Damon

On 1/10/20 1:43 PM, Tim Streater wrote:

On 10 Jan 2020, at 18:03, Richard Hipp  wrote:


On 1/10/20, Dominique Devienne  wrote:

There's no way at all, to know the length of a text column with embedded
NULLs?


You can find the true length of a string in bytes from C-code using
the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
of a way to do that from SQL.

But if I store UTF-8 in a TEXT column, surely I'm allowed to include NULLs in 
that? They are after all valid UTF-8 characters.



As has been said, C Strings (which is what TEXT is assumed to hold) are 
not allowed to include null characters, but are assumed to terminate at 
the first 0 bytes. There is a variant of UTF-8, called modified UTF-8 or 
MUTF-8, which allows a null character to be encoded as C0 80, which does 
decode to 0 by the base UTF-8 rules, but is disallowed by the minimum 
encoding rule, which can be used to embed nulls in strings if the system 
doesn't enforce the minimum length encoding rule (at least for this 
character).


I have no idea if that would work with SQLite though.

--
Richard Damon

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Keith Medcalf

On Friday, 10 January, 2020 11:44, Tim Streater  wrote:

>On 10 Jan 2020, at 18:03, Richard Hipp  wrote:

>> On 1/10/20, Dominique Devienne  wrote:
>>> There's no way at all, to know the length of a text column with
>>> embedded NULLs?

>> You can find the true length of a string in bytes from C-code using
>> the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
>> of a way to do that from SQL.

>But if I store UTF-8 in a TEXT column, surely I'm allowed to include
>NULLs in that? They are after all valid UTF-8 characters.

No, they are not.  The "NUL character" in Modified UTF-8 is the two-byte 
sequence 0xC0 0x80.  This is specifically so that 0x00 can be used as a string 
terminator.  Validly encoded UTF-8 encoded text stored in a C String (0x00 
terminated sequence of bytes) must not contain an embedded 0x00 byte since that 
byte terminates the sequence.

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.




___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Tim Streater
On 10 Jan 2020, at 18:03, Richard Hipp  wrote:

> On 1/10/20, Dominique Devienne  wrote:
>>
>> There's no way at all, to know the length of a text column with embedded
>> NULLs?
>>
>
> You can find the true length of a string in bytes from C-code using
> the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
> of a way to do that from SQL.

But if I store UTF-8 in a TEXT column, surely I'm allowed to include NULLs in 
that? They are after all valid UTF-8 characters.



-- 
Cheers  --  Tim
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Keith Medcalf

On Friday, 10 January, 2020 10:50, Dominique Devienne :

>On Fri, Jan 10, 2020 at 4:30 PM Richard Hipp  wrote:

>> length() on a BLOB should show the number of bytes in the BLOB.

>> length() on a string should show the number of *characters* (not
>> bytes) in the string up through but not including the first
>> zero-character.  It is possible to have additional content after the
>> first zero-character in a string, which length() will not tell you
>> about.

> There's no way at all, to know the length of a text column with embedded
> NULLs?

C-Strings cannot have embedded nulls.  The very definition of a C-String is "a 
sequence of non-zero characters followed by a zero character".  So while you 
can store and retrieve invalid C-Strings in the database (as in use BIND and 
COLUMN_TEXT), "things" (meaning software) which expects a C-String to be a 
C-String will be confused by such improper usage of a C String, and "things" 
which expect "text" fields to contain properly encoded C-Strings are likely to 
have brain-seizures.

You can, of course, cast the column as a blob (which IS allowed to have 
embedded nulls and DOES NOT have to have valid text encoding), and get the 
length of that bag-o-bytes (in bytes):

sqlite> select * from copied;
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279
sqlite> select typeof(id_local),typeof(lutfullstring),typeof(luthash) from 
copied;
integer|text|text
integer|text|text
integer|text|text
sqlite> select length(cast(lutfullstring as blob)) from copied;
194238
183050
193908

The first select is indeed selecting the ENTIRE string that has been asked for. 
 It is then using printf "%s" to print it because it is SUPPOSED TO BE a null 
terminated string.  It is doing exactly what it has been told to do.  
Similarly, the function LENGTH() on what is supposed to be a NULL terminated 
string returns the number of characters up to but not including the NULL 
terminator.

The real problem here is that a BLOB has been stored as if it were TEXT.  This 
is a failure of the application to properly sanitize its input.

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume. 



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Richard Hipp
On 1/10/20, Dominique Devienne  wrote:
>
> There's no way at all, to know the length of a text column with embedded
> NULLs?
>

You can find the true length of a string in bytes from C-code using
the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
of a way to do that from SQL.
-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Dominique Devienne
On Fri, Jan 10, 2020 at 4:30 PM Richard Hipp  wrote:
> length() on a BLOB should show the number of bytes in the BLOB.
>
> length() on a string should show the number of *characters* (not
> bytes) in the string up through but not including the first
> zero-character.  It is possible to have additional content after the
> first zero-character in a string, which length() will not tell you about.

Hi Richard,

There's no way at all, to know the length of a text column with embedded NULLs?

--DD
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Simon Slavin
On 10 Jan 2020, at 3:11pm, Ryan Mack  wrote:

> OK, I think I've got a better sense now. Hex encoding the column shows that 
> there's actually a huge amount of data stored in there. For some reason 
> length() isn't revealing it even if the column type is blob. Dumping and 
> restoring the table is truncating the data.

Is there a chance that some part of the software thinks that 0x00 is a 
terminator for BLOBs, and some other part ignores the 0x00 and abides strictly 
by the length ?  I've seen that before.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Richard Hipp
On 1/10/20, Ryan Mack  wrote:
> OK, I think I've got a better sense now. Hex encoding the column shows
> that there's actually a huge amount of data stored in there. For some
> reason length() isn't revealing it even if the column type is blob.
> Dumping and restoring the table is truncating the data.

length() on a BLOB should show the number of bytes in the BLOB.

length() on a string should show the number of *characters* (not
bytes) in the string up through but not including the first
zero-character.  It is possible to have additional content after the
first zero-character in a string, which length() will not tell you
about.

-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Ryan Mack
OK, I think I've got a better sense now. Hex encoding the column shows
that there's actually a huge amount of data stored in there. For some
reason length() isn't revealing it even if the column type is blob.
Dumping and restoring the table is truncating the data.

On Fri, Jan 10, 2020 at 9:58 AM Ryan Mack  wrote:
>
> Accidentally sent my first reply direct, responding to the list. I'm
> now wondering if there's a lot of binary data hidden in each row.
> Trying to figure out how to determine that if length() doesn't show
> anything.
>
>
> Prior response:
>
> An excellent idea, thank you :-) .  The output is included below.
>
> I am continuing to do my own debugging in parallel. I am now leaning
> towards a new hypothesis that there may be a bug handling
> strange/malformed unicode that is resulting in a large amount of
> garbage data being stored into the record. I'm trying to figure out
> the exact conditions of copying/loading data into the table that
> causes the issue to propagate or disappear.
>
>
>  % ./sqlite3_analyzer ~/Downloads/200k-per-row.sqlite
> /** Disk-Space Utilization Report For 
> /Users/nadia/Downloads/200k-per-row.sqlite
>
> Page size in bytes 4096
> Pages in the whole file (measured) 143
> Pages in the whole file (calculated).. 143
> Pages that store data. 143100.0%
> Pages on the freelist (per header) 00.0%
> Pages on the freelist (calculated) 00.0%
> Pages of auto-vacuum overhead. 00.0%
> Number of tables in the database.. 2
> Number of indices. 0
> Number of defined indices. 0
> Number of implied indices. 0
> Size of the file in bytes. 585728
> Bytes of user payload stored.. 571322  97.5%
>
> *** Page counts for all tables with their indices 
> *
>
> COPIED 142 99.30%
> SQLITE_MASTER. 10.70%
>
> *** Page counts for all tables and indices separately 
> *
>
> COPIED 142 99.30%
> SQLITE_MASTER. 10.70%
>
> *** All tables 
> 
>
> Percentage of total database.. 100.0%
> Number of entries. 4
> Bytes of storage consumed. 585728
> Bytes of payload.. 571412  97.6%
> Bytes of metadata. 1284 0.22%
> Average payload per entry. 142853.00
> Average unused bytes per entry 3393.00
> Average metadata per entry 321.00
> Average fanout 3.00
> Maximum payload per entry. 194280
> Entries that use overflow. 3   75.0%
> Index pages used.. 1
> Primary pages used 4
> Overflow pages used... 138
> Total pages used.. 143
> Unused bytes on index pages... 407099.37%
> Unused bytes on primary pages. 950258.0%
> Unused bytes on overflow pages 00.0%
> Unused bytes on all pages. 135722.3%
>
> *** Table COPIED 
> **
>
> Percentage of total database..  99.30%
> Number of entries. 3
> Bytes of storage consumed. 581632
> Bytes of payload.. 571322  98.2%
> Bytes of metadata. 1172 0.20%
> B-tree depth.. 2
> Average payload per entry. 190440.67
> Average unused bytes per entry 3226.00
> Average metadata per entry 390.67
> Average fanout 3.00
> Non-sequential pages.. 00.0%
> Maximum payload per entry. 194280
> Entries that use overflow. 3  100.0%
> Index pages used.. 1
> Primary pages used 3
> Overflow pages used... 138
> Total pages used.. 142
> Unused bytes on index pages... 407099.37%
> Unused bytes on primary pages. 560845.6%

Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Ryan Mack
Accidentally sent my first reply direct, responding to the list. I'm
now wondering if there's a lot of binary data hidden in each row.
Trying to figure out how to determine that if length() doesn't show
anything.


Prior response:

An excellent idea, thank you :-) .  The output is included below.

I am continuing to do my own debugging in parallel. I am now leaning
towards a new hypothesis that there may be a bug handling
strange/malformed unicode that is resulting in a large amount of
garbage data being stored into the record. I'm trying to figure out
the exact conditions of copying/loading data into the table that
causes the issue to propagate or disappear.


 % ./sqlite3_analyzer ~/Downloads/200k-per-row.sqlite
/** Disk-Space Utilization Report For /Users/nadia/Downloads/200k-per-row.sqlite

Page size in bytes 4096
Pages in the whole file (measured) 143
Pages in the whole file (calculated).. 143
Pages that store data. 143100.0%
Pages on the freelist (per header) 00.0%
Pages on the freelist (calculated) 00.0%
Pages of auto-vacuum overhead. 00.0%
Number of tables in the database.. 2
Number of indices. 0
Number of defined indices. 0
Number of implied indices. 0
Size of the file in bytes. 585728
Bytes of user payload stored.. 571322  97.5%

*** Page counts for all tables with their indices *

COPIED 142 99.30%
SQLITE_MASTER. 10.70%

*** Page counts for all tables and indices separately *

COPIED 142 99.30%
SQLITE_MASTER. 10.70%

*** All tables 

Percentage of total database.. 100.0%
Number of entries. 4
Bytes of storage consumed. 585728
Bytes of payload.. 571412  97.6%
Bytes of metadata. 1284 0.22%
Average payload per entry. 142853.00
Average unused bytes per entry 3393.00
Average metadata per entry 321.00
Average fanout 3.00
Maximum payload per entry. 194280
Entries that use overflow. 3   75.0%
Index pages used.. 1
Primary pages used 4
Overflow pages used... 138
Total pages used.. 143
Unused bytes on index pages... 407099.37%
Unused bytes on primary pages. 950258.0%
Unused bytes on overflow pages 00.0%
Unused bytes on all pages. 135722.3%

*** Table COPIED **

Percentage of total database..  99.30%
Number of entries. 3
Bytes of storage consumed. 581632
Bytes of payload.. 571322  98.2%
Bytes of metadata. 1172 0.20%
B-tree depth.. 2
Average payload per entry. 190440.67
Average unused bytes per entry 3226.00
Average metadata per entry 390.67
Average fanout 3.00
Non-sequential pages.. 00.0%
Maximum payload per entry. 194280
Entries that use overflow. 3  100.0%
Index pages used.. 1
Primary pages used 3
Overflow pages used... 138
Total pages used.. 142
Unused bytes on index pages... 407099.37%
Unused bytes on primary pages. 560845.6%
Unused bytes on overflow pages 00.0%
Unused bytes on all pages. 9678 1.7%

*** Table SQLITE_MASTER ***

Percentage of total database..   0.70%
Number of entries. 1
Bytes of storage consumed. 4096
Bytes of payload.. 90   2.2%
Bytes of 

Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Richard Hipp
On 1/10/20, Ryan Mack  wrote:
>
> I'm trying to understand unexplained table bloat

The sqlite3_analyzer command-line utility program (available  in the
"Precompiled binaries" bundles on the https://sqlite.org/download.html
page) is designed to help understand these kinds of problems.  Please
run that utility on the database and perhaps post the output here.
-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Ryan Mack
Thank you, I was unaware of the integrity_check pragma. It returns OK
for the database in question.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread Simon Slavin

On 10 Jan 2020, at 2:06pm, David Raymond  wrote:

> Well something's weird anyway. When I open it with the command line tool it 
> queries it just fine

Did you run an integrity_check on the database ?  It looks from your posts as 
if it's corrupt.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unexplained table bloat

2020-01-10 Thread David Raymond
Well something's weird anyway. When I open it with the command line tool it 
queries it just fine. I tried to open it in Python to check all the characters 
in the strings and got this:

Traceback (most recent call last):
  File "...\Testing4.py", line 8, in 
cur.execute("select * from copied;")
sqlite3.OperationalError: Could not decode to UTF-8 column 'LUTFullString' with 
text ','


Which I suppose it just as likely to be my own problem though.


-Original Message-
From: sqlite-users  On Behalf Of 
Ryan Mack
Sent: Friday, January 10, 2020 8:48 AM
To: sqlite-users@mailinglists.sqlite.org
Subject: [sqlite] Unexplained table bloat

Hi list,

I'm trying to understand unexplained table bloat I found in what should be
a very small table in an old database file. If you dump/restore the
database, the problem goes away. If you duplicate the table, the problem
propagates. Schema:

CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

Test Data:
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

In my test database file, each of these 3 rows is allocating about 40 4k
overflow pages for a total database size of about 500k. The full database
has about 4MB of actual data which takes up over 500MB on disk. If you want
to see/reproduce the problem you'll need my test database file which I've
uploaded here: https://mackman.net/200k-per-row.sqlite.zip (500kb
download). I don't know why it doesn't compress better, those extra
overflow pages must be getting filled with random garbage.

My uninformed guess is there was a bug in the version of sqlite used at
database creation time that computed an incorrect overflow threshold and is
storing each byte of the row to its own page. Since the problem goes away
with a dump/restore, I'm considering releasing a script to do that and
mitigate the problem for affected users. Before doing that I would like to
understand the problem better.

Thanks for reading, Ryan

PS: Here's some output from my debug session showing the 123 bytes of data
is occupying 582k of space on disk in a freshly created table.

% sqlite3 200k-per-row.sqlite
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> .tables
copied

sqlite> .schema copied
CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

sqlite> select * from copied;
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

sqlite> select sum(length(id_local) + length(lutfullstring) +
length(luthash)) from copied;
123

sqlite> create table copied2 as select * from copied;
sqlite> SELECT name, SUM("pgsize") as sz FROM dbstat group by name;
copied|581632
copied2|581632
sqlite_master|4096
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Unexplained table bloat

2020-01-10 Thread Ryan Mack
Hi list,

I'm trying to understand unexplained table bloat I found in what should be
a very small table in an old database file. If you dump/restore the
database, the problem goes away. If you duplicate the table, the problem
propagates. Schema:

CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

Test Data:
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

In my test database file, each of these 3 rows is allocating about 40 4k
overflow pages for a total database size of about 500k. The full database
has about 4MB of actual data which takes up over 500MB on disk. If you want
to see/reproduce the problem you'll need my test database file which I've
uploaded here: https://mackman.net/200k-per-row.sqlite.zip (500kb
download). I don't know why it doesn't compress better, those extra
overflow pages must be getting filled with random garbage.

My uninformed guess is there was a bug in the version of sqlite used at
database creation time that computed an incorrect overflow threshold and is
storing each byte of the row to its own page. Since the problem goes away
with a dump/restore, I'm considering releasing a script to do that and
mitigate the problem for affected users. Before doing that I would like to
understand the problem better.

Thanks for reading, Ryan

PS: Here's some output from my debug session showing the 123 bytes of data
is occupying 582k of space on disk in a freshly created table.

% sqlite3 200k-per-row.sqlite
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> .tables
copied

sqlite> .schema copied
CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

sqlite> select * from copied;
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

sqlite> select sum(length(id_local) + length(lutfullstring) +
length(luthash)) from copied;
123

sqlite> create table copied2 as select * from copied;
sqlite> SELECT name, SUM("pgsize") as sz FROM dbstat group by name;
copied|581632
copied2|581632
sqlite_master|4096
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users