Re: [sqlite] Unexplained table bloat
On Sun, 12 Jan 2020 16:24:53 -0700 "Keith Medcalf" wrote: > while one may be tempted to claim that "consistency is the hobgoblin > of little minds" You might have forgotten that the much overpraised Ralph Waldo specified "a foolish consistency". He only meant: don't try to hang your hat on too small a peg. > Doing this does not really do much since you still have to check the > type on retrieval of the value anyway in order to know what to do > with it. That depends what "much" is. The value of constraints used to enforce types is to reject from the database values outside the domain. That not only simplifies application logic, but also the logical consistency of the queries themselves. If "year" is always an integer -- never NULL, never a string -- then avg(year) and count(year) are always correct. But if the database contains for "year" a string like "it was a good one", or NULLs, they're both unreliable. --jkl ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Sun, 12 Jan 2020 15:48:05 -0700 "Keith Medcalf" wrote: > SQL is simply a Structured Query Language that can be overlaid on a > variety of datastore models, one of which is the Relational model. I think that's misleading, Keith, for a couple of reasons. SQL was invented by IBM expressly and specfically for its "relational" product. Its wordiness is the product of its roots in "4th generation" languages in vogue at the time, the promise of which was to permit users to "write their own reports". Hence the strict select-from-where syntax, meant to be so simple that untrained users could figure it out. While its true, as you say, that many pre-relational systems (and post-relational ones, later, feh) added SQL on their shingles, those were never coherent implementations. They were a way to say Yes! when the question was, "Do you support SQL?" > Many (most in fact) datastores require that all instances of the same > "column" in an "entity" be the same type To the extent that SQL implements relational algebra/calculus, its utility and consistency *requires* that each column be of a defined type. For example, if the query includes, where A between 0 and 1 and A is a *not* a numeric value, then the query is reduced from rigorous first-order predicate logic to gibberish. SQLite serves a particular niche very well. The choice not to enforce type constraints for declared datatypes biases the system toward ease-of-insertion. That has advantages in the SQLite problem domain, but the trade-off comes at a price: it makes SQLite *harder* to use in a rigorous way. --jkl ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Monday, 13 January, 2020 06:36, Dominique Devienne wrote: > Please remind me, is the encoding a "client-side" setting, or also a > "server-side" (i.e. stored) setting? "pragma encoding " sets the internal storage format for text encoding in a database (on the "inside SQLite" side of the API demarcation). It is persistent (a database attribute) and cannot be changed once the database has been created. The boundary between "internal" and "external" is the SQLite3 API. Thus you send "external" text into SQLite using the sqlite3_bind_text* sqlite3_result_text* interfaces. The call specifies the format you are using for the "external" representation. When the text gets "internal" (to the other side of the API) it is stored (and converted if necessary to) the "pragma encoding" setting of the database. You retrieve the "internal" values into your external program using sqlite3_column_text* and sqlite3_value_text* interfaces, which converts (if necessary) the data from the "internal" encoding to the external encoding you requested. https://sqlite.org/pragma.html#pragma_encoding The length() function is an external function (as in on the "exernal" side of the demarcation). It retrieves its arguments using the "sqlite3_value" interfaces from the internal storage format into the requested external storage format, does its thing, and then returns an external result back into the "internal" part of SQLite using the sqlite3_result* interface. Contrast this with a function such as cast(...) which is strictly internal. It operates entirely internally to SQLite directly on the internal data and does not retrieve the "internal data" into an external format then send the converted external data back into SQLite. -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
Well, I believe this is the relevant bit from the docs for binding: https://www.sqlite.org/c3ref/bind_blob.html "If a non-negative fourth parameter is provided to sqlite3_bind_text() or sqlite3_bind_text16() or sqlite3_bind_text64() then that parameter must be the byte offset where the NUL terminator would occur assuming the string were NUL terminated. If any NUL characters occur at byte offsets less than the value of the fourth parameter then the resulting string value will contain embedded NULs. The result of expressions involving strings with embedded NULs is undefined. " -Original Message- From: sqlite-users On Behalf Of Barry Smith Sent: Monday, January 13, 2020 1:54 PM To: SQLite mailing list Subject: Re: [sqlite] Unexplained table bloat On the original topic... How does one end up with a database in this state? I.e with a binary value that contains 0x00 bytes followed by other bytes but a type of TEXT? If the definition of a text string in SQLite is that it ends at the first 0x00 byte, then it seems that anything stored as a text string should adhere to that. So a database with a TEXT value that contains characters after the first 0x00 should be considered corrupt. Given that to retrieve the actual contents of the cell it must be cast to BLOB, why not force the storage of any string that contains 0x00 as a BLOB in the first place? What am I missing here? On 13 Jan 2020, at 6:02 am, Simon Slavin wrote: > > On 13 Jan 2020, at 9:26am, Dominique Devienne wrote: > >> Which implies length(text_val) is O(N), while >> length(blob_val) is O(1), >> something I never quite realized. > > For this reason, and others discussed downthread, some languages which store > Unicode strings store the number of graphemes as well as its contents. So > functions which care about the … let's call it "width" … just retrieve that > number rather than having to parse the string to figure out the length. > > In a Unicode string 'length' can mean > > 1) octet count (number of 8-bit bytes used to store the string) > 2) number of code points (basic unicode unit) > 3) number of code units (how code points get arranged in UTF8, UTF16, etc., > not as simple as it looks) > 4) length in graphemes (space-using units) > 5) length in glyphs (font-rendering units) > > and probably others I've forgotten. Not to mention that I simplified the > definitions of the above and may have got them wrong. > > An application centred around rendering text (e.g. vector graphics drawing > apps) might have each piece of text stored with all five of those numbers, > just to save it from having to constantly recalculate them. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On the original topic... How does one end up with a database in this state? I.e with a binary value that contains 0x00 bytes followed by other bytes but a type of TEXT? If the definition of a text string in SQLite is that it ends at the first 0x00 byte, then it seems that anything stored as a text string should adhere to that. So a database with a TEXT value that contains characters after the first 0x00 should be considered corrupt. Given that to retrieve the actual contents of the cell it must be cast to BLOB, why not force the storage of any string that contains 0x00 as a BLOB in the first place? What am I missing here? On 13 Jan 2020, at 6:02 am, Simon Slavin wrote: > > On 13 Jan 2020, at 9:26am, Dominique Devienne wrote: > >> Which implies length(text_val) is O(N), while >> length(blob_val) is O(1), >> something I never quite realized. > > For this reason, and others discussed downthread, some languages which store > Unicode strings store the number of graphemes as well as its contents. So > functions which care about the … let's call it "width" … just retrieve that > number rather than having to parse the string to figure out the length. > > In a Unicode string 'length' can mean > > 1) octet count (number of 8-bit bytes used to store the string) > 2) number of code points (basic unicode unit) > 3) number of code units (how code points get arranged in UTF8, UTF16, etc., > not as simple as it looks) > 4) length in graphemes (space-using units) > 5) length in glyphs (font-rendering units) > > and probably others I've forgotten. Not to mention that I simplified the > definitions of the above and may have got them wrong. > > An application centred around rendering text (e.g. vector graphics drawing > apps) might have each piece of text stored with all five of those numbers, > just to save it from having to constantly recalculate them. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 13 Jan 2020, at 9:26am, Dominique Devienne wrote: > Which implies length(text_val) is O(N), while > length(blob_val) is O(1), > something I never quite realized. For this reason, and others discussed downthread, some languages which store Unicode strings store the number of graphemes as well as its contents. So functions which care about the … let's call it "width" … just retrieve that number rather than having to parse the string to figure out the length. In a Unicode string 'length' can mean 1) octet count (number of 8-bit bytes used to store the string) 2) number of code points (basic unicode unit) 3) number of code units (how code points get arranged in UTF8, UTF16, etc., not as simple as it looks) 4) length in graphemes (space-using units) 5) length in glyphs (font-rendering units) and probably others I've forgotten. Not to mention that I simplified the definitions of the above and may have got them wrong. An application centred around rendering text (e.g. vector graphics drawing apps) might have each piece of text stored with all five of those numbers, just to save it from having to constantly recalculate them. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Mon, Jan 13, 2020 at 1:10 PM Keith Medcalf wrote: > If the register object contains "text" and you cast it to a blob (remove the > text affinity) you are left with just the bag-o-bytes, and length() will > return the size of the bag encoded in the register. If the data in the > register is other than type "text" then it must be converted to text first > (in the database encoding) and then the cast will remove the text affinity, > after which the value returned by the length() function will be the number of > bytes in the bag that holds that text representation: > > sqlite> pragma encoding='utf-16'; > sqlite> create table x(x); > sqlite> insert into x values ('text' || char(0) || 'text'); > sqlite> select x, typeof(x), length(x), length(cast(x as blob)) from x; > text|text|4|18 Please remind me, is the encoding a "client-side" setting, or also a "server-side" (i.e. stored) setting? I wasn't sure whether pragma encoding='utf-16' affected the stored state as well, or whether it was always in UTF-8 and SQLite was doing conversion on the fly for the client requested encoding. I thought of lengthof() as the size stored in the value header itself, which I assumed was always in bytes. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 1/13/20 5:24 AM, Dominique Devienne wrote: On Mon, Jan 13, 2020 at 11:07 AM Keith Medcalf wrote: On Monday, 13 January, 2020 02:27, Dominique Devienne wrote: I'd vote for a lengthof(col) that's always O(1) for both text and blob So what should lengthof(something) return the number of bytes in the 'database encoding' or something else? Bytes of course. Of the data stored, i.e. excluding the header byte and encoded size (if any) from the file-format. Basically the same as length() *except* for text values, resulting in O(1) behavior. --DD PS: I keep forgetting length(text_val) returns the number of code-points in fact :) PPS: Surrogate pairs count as one or two code points? That's just bait, I don't really want to know :))) Re the PPS, UTF-8 isn't allowed to have Surrogate Pairs. Non-BMP characters which would use Surrogate Pairs in UTF-16 are supposed to be converted to their fundamental 21 bit value and that encoded into UTF-8. If the code doesn't validate the data well enough to catch that issue, then I suspect the character counting would count each half of the surrogate pairs as a code-point, -- Richard Damon ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
If the register object contains "text" and you cast it to a blob (remove the text affinity) you are left with just the bag-o-bytes, and length() will return the size of the bag encoded in the register. If the data in the register is other than type "text" then it must be converted to text first (in the database encoding) and then the cast will remove the text affinity, after which the value returned by the length() function will be the number of bytes in the bag that holds that text representation: SQLite version 3.31.0 2020-01-12 23:30:01 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. sqlite> pragma encoding; UTF-8 sqlite> create table x(x); sqlite> insert into x values ('text' || char(0) || 'text'); sqlite> insert into x values (3.14159); sqlite> select x, typeof(x), length(x), length(cast(x as blob)) from x; text|text|4|9 3.14159|real|7|7 sqlite> .q SQLite version 3.31.0 2020-01-12 23:30:01 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. sqlite> pragma encoding='utf-16'; sqlite> pragma encoding; UTF-16le sqlite> create table x(x); sqlite> insert into x values ('text' || char(0) || 'text'); sqlite> insert into x values (3.14159); sqlite> select x, typeof(x), length(x), length(cast(x as blob)) from x; text|text|4|18 3.14159|real|7|14 sqlite> .q -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 2020/01/13 12:24 PM, Dominique Devienne wrote: Bytes of course. Of the data stored, i.e. excluding the header byte I checked, I was apparently correct about the casting. This following extract from a DB I got from a forum member with Greek-to-Danish translations where I added length checks: -- SQLite version 3.30.1 [ Release: 2019-10-10 ] on SQLitespeed version 2.1.3.11. -- SELECT Glose_ID, Græsk, Dansk, length(Dansk) AS len, length(CAST(Dansk AS BLOB)) AS bytelen FROM Gloser WHERE 1 LIMIT 20; -- Glose_ID | Græsk | Dansk | len | bytelen -- | -- | | --- | --- -- 1 | απόδειξη η | 1. bevis 2. kvittering | 23 | 23 -- 2 | δεικτική αντωνυμία | påpegende stedord, demonstrativ pronomen | 40 | 41 -- 3 | δεικτικός-ή-ό | påpegende | 9 | 10 -- 4 | καθόλου | slet (ikke), overhovedet (ikke) | 31 | 31 -- 5 | κώδικας ο | 1. kode 2. lov | 14 | 14 -- 6 | πλην | (mat) minus; (præp + G) undtagen | 32 | 33 -- 7 | προσδιορίζω | præcisere, fastsætte, bestemme (nøjagtigt) | 42 | 45 -- 8 | προσδιορισμός ο | fastsættelse, (nøjagtig) bestemmelse | 36 | 38 -- 9 | προσδιορισής ο | ?? (- RH) foremntlig samme betydning som προσδιορισμός | 54 | 67 -- 10 | φακός ο | 1. linse 2. glas, brilleglas 3. lygte, lommelygte | 51 | 51 -- 11 | βρώμικος-η-ο | snavset, beskidt; (fig) lyssky, skummel | 39 | 39 -- 12 | δε μου λες... | sig mig engang… (egl “du fortæller mig ikke”) | 45 | 52 -- 13 | δημοσιογράφος ο/η | journalist | 10 | 10 -- 14 | κέρμα το | mønt, småmønt | 13 | 16 -- 15 | κοπέλα η | pige | 4 | 4 -- 16 | μαθήτρια η | elev, skoleelev (kvindelig) | 27 | 27 -- 17 | μαθητής ο | elev, skoleelev (mandlig) | 25 | 25 -- 18 | μπουκάλι το | flaske | 6 | 6 -- 20 | ποτήρι το | glas (Både glas og flaske er intetkøn som vin. Øl er hunkøn) | 60 | 64 -- 21 | ίσιος-α-ο | lige | 4 | 4 Hope that format is not too messed up by my e-mailer. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 2020/01/13 12:24 PM, Dominique Devienne wrote: On Mon, Jan 13, 2020 at 11:07 AM Keith Medcalf wrote: On Monday, 13 January, 2020 02:27, Dominique Devienne wrote: I'd vote for a lengthof(col) that's always O(1) for both text and blob So what should lengthof(something) return the number of bytes in the 'database encoding' or something else? Bytes of course. Of the data stored, i.e. excluding the header byte and encoded size (if any) from the file-format. Basically the same as length() *except* for text values, resulting in O(1) behavior. --DD Please have mercy - my beautiful text-extracty queries will no longer work! I'm OK with a second other function, perhaps called bytelength() that returns what you suggest, but length is mostly used in any text extracting or parsing along with with other functions like instr() and substr() etc. which all take character indexes based on positions of code-points and not of actual bytes. Back in the day when all text was ANSI ASCII 8-byte Latin-English characters this was easy, but now it ain't. That said, I thought that if you cast a string to BLOB and then query the length, like: SELECT length(CAST(firstname AS BLOB)) AS bytelen; You essentially get what you asked for, but I've never used it, so am not sure. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Mon, Jan 13, 2020 at 11:07 AM Keith Medcalf wrote: > On Monday, 13 January, 2020 02:27, Dominique Devienne > wrote: > >> I'd vote for a lengthof(col) that's always O(1) for both text and blob > > So what should lengthof(something) return the number of bytes in the > 'database encoding' or something else? Bytes of course. Of the data stored, i.e. excluding the header byte and encoded size (if any) from the file-format. Basically the same as length() *except* for text values, resulting in O(1) behavior. --DD PS: I keep forgetting length(text_val) returns the number of code-points in fact :) PPS: Surrogate pairs count as one or two code points? That's just bait, I don't really want to know :))) ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Monday, 13 January, 2020 02:27, Dominique Devienne wrote: >On Fri, Jan 10, 2020 at 7:03 PM Richard Hipp wrote: >> On 1/10/20, Dominique Devienne wrote: >>> There's no way at all, to know the length of a text column with >>> embedded NULLs? >> You can find the true length of a string in bytes from C-code using >> the sqlite3_column_bytes() interface. But I cannot, off-hand, think >> of a way to do that from SQL. > That's what I thought. Which implies length(text_val) is O(N), while > length(blob_val) is O(1), something I never quite realized. And this > despite both storing the length the same way at the file-format level. > That's kind of a gotcha, which might be worth documenting. No, they are measured in different units. length(something) where something is NOT a text value returns the length in BYTES of that something. For a BLOB this is the number of bytes according to the length data stored in the database. For NULL, INTEGER, or REAL values, it is the number of BYTES in the ASCII text representation of that thing (which also happens to be the number of characters since each ASCII character takes one byte). length(something) where something IS a text value returns the number of codepoints in the UTF-8 representation of that text and counting stops at the zero terminator (since the definition of "text" is a C string, all text values have an appended 0 terminator -- if there is an embedded 0 character, that terminates the counting). >> I'd vote for a lengthof(col) that's always O(1) for both text and blob >> values, although I'm not holding my breath for it, as I suspect it's >> unlikely to be added, given its relative low value. So what should lengthof(something) return the number of bytes in the 'database encoding' or something else? -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Fri, Jan 10, 2020 at 7:03 PM Richard Hipp wrote: > On 1/10/20, Dominique Devienne wrote: > > There's no way at all, to know the length of a text column with embedded > > NULLs? > > You can find the true length of a string in bytes from C-code using > the sqlite3_column_bytes() interface. But I cannot, off-hand, think > of a way to do that from SQL. That's what I thought. Which implies length(text_val) is O(N), while length(blob_val) is O(1), something I never quite realized. And this despite both storing the length the same way at the file-format level. That's kind of a gotcha, which might be worth documenting. I'd vote for a lengthof(col) that's always O(1) for both text and blob values, although I'm not holding my breath for it, as I suspect it's unlikely to be added, given its relative low value. I don't disagree with Keith that text shouldn't contain embedded NULLs, and that C-String are by design and definition that way, I was more thinking of the C++ use case of having an std::string with embedded NULLs, which is perfectly OK and common enough, and using a C++ wrapper for SQLite (which typically uses overloading for binding for example), which will insert a text value for that case, using .c_str() + .length() (or .data() + .size(), doesn't matter, ends up the same), leading to the very issue that started this thread (just a guess). The inability to correctly size a value in SQL (and thus a column, via a sum() group by) for text with embedded nulls is quite unfortunate. And the fact length(text_col) is also O(N) is similarly unfortunate. Thus the above idea of an O(1) lengthof(col), as a companion to typeof(col). Thanks, --DD ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 2020/01/13 12:25 AM, Tom Browder wrote: On Sun, Jan 12, 2020 at 14:05 Keith Medcalf wrote: Close, but no banana. Every value has a type. A column may contain multiple values (as in one per row) Thanks, Keith. I assume that is just for SQLite, or am I wrong again? You are not wrong. This comes up from time to time and is always quite interesting. As Keith explained, the Relational Model can be applied on most kinds of data-stores. SQLite's data store happens to allow per-value typing, and then only using "Duck typing" (https://en.wikipedia.org/wiki/Duck_typing). While on the subject of SQLite peculiarities, there's a few to note if you are new to SQLite, like: - You don't need any type for a column, but untyped columns are treated as having blob affinity, not text. - A column declared as VARCHAR(30), or TEXT(30) or INT(11) for that matter, will be fully accepted, but there is no actual length constraint. You can put any length value in the column. - NULL values are distinct from each other (in some cases) so that if A is NULL and B is NULL, then A = B returns FALSE (0) and A <> B also returns FALSE!. This is useful, but note that a Primary Key in SQLite without also having the NOT NULL constraint, will allow duplicate NULL values in the key. (the row-id alias being the exception). - While on the point, the special type "INTEGER PRIMARY KEY" is an Alias for the internal index (row-id) of the table, and so doesn't allow NULLs. - While on it still... Tables don't always have row-ids, and there is no easy check to know if it does, so unless you made the DB yourself, you can't safely just query the row-id. - Since typing is per value, it is not an error in SQLite for Parent and Child Key columns in a Foreign-Key relationship to have different types - this can have some unexpected results! - A datetime is a Numeric type expressed as text (ISO8601) and doesn't inherently know anything about time or time-zones. (The date-handling functions work amazingly well though). - A column with the type STRING will have Integer affinity. - A spelling error in your schema, such as CREATE TABLE t(id INT, name T3XT); will not be an error - SQLite will silently regard that column affinity as blob. - Quotes are more or less ambivalent... CREATE TABLE t("id" int, [name] text, `age` [int]) is a perfectly valid schema statement, and Set a = "John" can mean different things based on whether there is a column named John or not. I'm probably not remembering all of them now, but we've spent lots of time musing about it before (forum searches will probably provide a host of discussions, including much lobbying for a "strict" mode) and because of all that, there's a fun feature added in SQLitespeed that does Schema-checking and prints a list of warnings if it contains one or more of these SQLite quirks (including misspelled types, which helps me a lot), and on the SQLite site there's also a section on some of these peculiarities (https://www.sqlite.org/quirks.html). I'm hoping someone else will add the quirks which I forgot about :) We all came to love (mostly) and often use these quirks to some advantage, but it pays to be aware of them, especially coming from another DB architecture where typing and the like are more rigid. Cheers! Ryan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Sunday, 12 January, 2020 15:31, Simon Slavin wrote: >You're generally right. SQLite always uses affinities (more or less >'weak typing') rather than strong typing. I don't know of any other SQL >implementations which allow this without explicit declaration, and most >don't allow it at all. SQLite is strongly typed however "entities" (table rows) are composed of a fixed number of dynamically typed values. A collection of entities (a table) has an affinity (preference for a particular type if possible) for each "column" in the collection. This is not "weak typing" nor is it "duck typing". It is "dynamic typing". >You can enforce strong typing in SQLite using a constraint. But most >people enforce type in their own code, before the value gets to SQLite. Well, no. You can restrict the size of the dynamic though a check constraint, but you cannot make the members of entities strongly typed. For example: create table x ( c text check (typeof(c) is 'text') ); does not strongly type the column "c" of entity collection "x". What it does is ensure that only text values are stored in that particular dynamically typed column of the entity. You will note that the declaration is inconsistent -- the two available correct forms would be: create table x ( c text not null check (typeof(c) is 'text') ); and create table x ( c text check (typeof(c) in ('null', 'text')) ); while one may be tempted to claim that "consistency is the hobgoblin of little minds" its converse, inconsistency, is often a source of errors. Some affinities are more complicated. For example: create table x ( c numeric check (typeof(c) in ('null', 'real', 'integer')) ); both 'real' and 'integer' are required for numeric affinity enforcement, and the 'null' because the column may contain nulls. Doing this does not really do much since you still have to check the type on retrieval of the value anyway in order to know what to do with it. -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Sunday, 12 January, 2020 15:29, Richard Damon wrote: >On 1/12/20 5:25 PM, Tom Browder wrote: >> On Sun, Jan 12, 2020 at 14:05 Keith Medcalf wrote: >>> On Sunday, 12 January, 2020 09:03, Tom Browder >>> wrote: Am I missing something? I thought every column has to have a type? >>> Close, but no banana. Every value has a type. A column may contain >>> multiple values (as in one per row) >> I assume that is just for SQLite, or am I wrong again? > That the entries for a given column in different rows can have different > types is a peculiarity of SQLite. In a 'Standard' SQL database, a column > has a defined type, and all rows will have values of that type (or NULL). It is a peculiarity of the underlying datastore used by SQLite. Many (most in fact) datastores require that all instances of the same "column" in an "entity" be the same type -- some do not (SQLite is one of them). SQL is simply a Structured Query Language that can be overlaid on a variety of datastore models, one of which is the Relational model. When SQL is used on other database organizations it is entirely possible for the "type" of a particular returned "column" to vary by row as it may have been fetched from different entities. DB-Vista, MDBS, and NOMAD are a couple of CODASYL style databases which have (optional extra) SQL query interfaces that can return data of multiple value types row by each for the same column. -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 12 Jan 2020, at 22:25, Tom Browder wrote: > On Sun, Jan 12, 2020 at 14:05 Keith Medcalf wrote: > >> On Sunday, 12 January, 2020 09:03, Tom Browder >> wrote: >> >Am I missing something? I thought every column has to have a type? >> >> Close, but no banana. Every value has a type. A column may contain >> multiple values (as in one per row) > I assume that is just for SQLite, or am I wrong again? In general that's correct. Lengths of VARCHARs are ignored, too, as you probably saw. -- Cheers -- Tim ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 12 Jan 2020, at 10:25pm, Tom Browder wrote: > I assume that is just for SQLite, or am I wrong again? You're generally right. SQLite always uses affinities (more or less 'weak typing') rather than strong typing. I don't know of any other SQL implementations which allow this without explicit declaration, and most don't allow it at all. You can enforce strong typing in SQLite using a constraint. But most people enforce type in their own code, before the value gets to SQLite. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 1/12/20 5:25 PM, Tom Browder wrote: On Sun, Jan 12, 2020 at 14:05 Keith Medcalf wrote: On Sunday, 12 January, 2020 09:03, Tom Browder wrote: Am I missing something? I thought every column has to have a type? Close, but no banana. Every value has a type. A column may contain multiple values (as in one per row) Thanks, Keith. I assume that is just for SQLite, or am I wrong again? -Tom That the entries for a given column in different rows can have different types is a peculiarity of SQLite. In a 'Standard' SQL database, a column has a defined type, and all rows will have values of that type (or NULL). -- Richard Damon ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Sun, Jan 12, 2020 at 14:05 Keith Medcalf wrote: > On Sunday, 12 January, 2020 09:03, Tom Browder > wrote: > >Am I missing something? I thought every column has to have a type? > > Close, but no banana. Every value has a type. A column may contain > multiple values (as in one per row) Thanks, Keith. I assume that is just for SQLite, or am I wrong again? -Tom ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Sunday, 12 January, 2020 09:03, Tom Browder wrote: >Am I missing something? I thought every column has to have a type? Close, but no banana. Every value has a type. A column may contain multiple values (as in one per row). Therefore each of those values has a type, which may be different from the type of the value in the same column on another row. Columns have an affinity, which is a preference for the type of the value to be stored in that columns' rows if conversion from the value provided to be stored to the specified affinity is possible, lossless, and reversible. https://sqlite.org/datatype3.html -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
Am I missing something? I thought every column has to have a type? -Tom ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
Hi, I checked the download mentioned in the original email. Not sure if the table changed since the previous posts. It seems LUTFullString has 3 BLOB rows, but LENGTH treats them as strings. I'm in Melbourne, Oz, so I added the UTC datetime. regs, Kev kevin@KCYDell:~$ cd /mnt/KCY/KCYDocs/ kevin@KCYDell:/mnt/KCY/KCYDocs$ sqlite3 /mnt/KCY/KCYDocs/200k-per- row.sqlite SQLite version 3.30.1 2019-10-10 20:19:45 Enter ".help" for usage hints. sqlite> .schema CREATE TABLE copied( id_local INT, LUTFullString, LUTHash ); sqlite> sqlite> select '1',rowid,LENGTH(HEX(LUTFullString))/2 from copied ...> UNION ALL ...> select '2',rowid,INSTR(HEX(LUTFullString),'00') from copied ...> UNION ALL ...> select '3',rowid,substr(HEX(LUTFullString),0,10) from copied ...> UNION ALL ...> select '4',rowid,INSTR(SUBSTR(HEX(LUTFullString),4),'00') from copied ...> UNION ALL ...> select '5',rowid,LENGTH(LUTFullString) from copied; 1|1|194238 1|2|183050 1|3|193908 2|1|3 2|2|3 2|3|3 3|1|2C0003007 3|2|2C0003007 3|3|2C0003007 4|1|1 4|2|1 4|3|1 5|1|1 5|2|1 5|3|1 sqlite> .quit kevin@KCYDell:/mnt/KCY/KCYDocs$ date -u Sat 11 Jan 23:39:43 UTC 2020 kevin@KCYDell:/mnt/KCY/KCYDocs$ Message: 6 Date: Fri, 10 Jan 2020 08:48:21 -0500 From: Ryan Mack To: sqlite-users@mailinglists.sqlite.org Subject: [sqlite] Unexplained table bloat Message-ID: < CABhGdGRbR1kT+3_BU6ob9L7tpSPZ09HJn=ofPyK6OXvgQK=_...@mail.gmail.com> Content-Type: text/plain; charset="UTF-8" Hi list, I'm trying to understand unexplained table bloat I found in what should be a very small table in an old database file. If you dump/restore the database, the problem goes away. If you duplicate the table, the problem propagates. Schema: CREATE TABLE copied( id_local INT, LUTFullString, LUTHash ); Test Data: 85960605|,|0DE19F8AA100D2962FF22C60B3FA8940 85960465|,|A44615408E8B3E48A684C60CA2967031 85960451|,|33C8804354984071A740788AD308B279 In my test database file, each of these 3 rows is allocating about 40 4k overflow pages for a total database size of about 500k. The full database has about 4MB of actual data which takes up over 500MB on disk. If you want to see/reproduce the problem you'll need my test database file which I've uploaded here: https://mackman.net/200k-per-row.sqlite.zip (500kb download). I don't know why it doesn't compress better, those extra overflow pages must be getting filled with random garbage. My uninformed guess is there was a bug in the version of sqlite used at database creation time that computed an incorrect overflow threshold and is storing each byte of the row to its own page. Since the problem goes away with a dump/restore, I'm considering releasing a script to do that and mitigate the problem for affected users. Before doing that I would like to understand the problem better. Thanks for reading, Ryan PS: Here's some output from my debug session showing the 123 bytes of data is occupying 582k of space on disk in a freshly created table. % sqlite3 200k-per-row.sqlite SQLite version 3.28.0 2019-04-15 14:49:49 Enter ".help" for usage hints. sqlite> .tables copied sqlite> .schema copied CREATE TABLE copied( id_local INT, LUTFullString, LUTHash ); sqlite> select * from copied; 85960605|,|0DE19F8AA100D2962FF22C60B3FA8940 85960465|,|A44615408E8B3E48A684C60CA2967031 85960451|,|33C8804354984071A740788AD308B279 sqlite> select sum(length(id_local) + length(lutfullstring) + length(luthash)) from copied; 123 sqlite> create table copied2 as select * from copied; sqlite> SELECT name, SUM("pgsize") as sz FROM dbstat group by name; copied|581632 copied2|581632 sqlite_master|4096 ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 1/10/20 2:24 PM, Tim Streater wrote: On 10 Jan 2020, at 18:55, Keith Medcalf wrote: On Friday, 10 January, 2020 11:44, Tim Streater wrote: On 10 Jan 2020, at 18:03, Richard Hipp wrote: On 1/10/20, Dominique Devienne wrote: There's no way at all, to know the length of a text column with embedded NULLs? You can find the true length of a string in bytes from C-code using the sqlite3_column_bytes() interface. But I cannot, off-hand, think of a way to do that from SQL. But if I store UTF-8 in a TEXT column, surely I'm allowed to include NULLs in that? They are after all valid UTF-8 characters. No, they are not. The "NUL character" in Modified UTF-8 is the two-byte sequence 0xC0 0x80. This is specifically so that 0x00 can be used as a string terminator. Validly encoded UTF-8 encoded text stored in a C String (0x00 terminated sequence of bytes) must not contain an embedded 0x00 byte since that byte terminates the sequence. Nice, but Wikipedia has that as a "Derivative" and "incompatible with the UTF-8 specification and may be rejected by conforming UTF-8 applications." It appears (though I may have missed it) not to be mentioned on this handy site either: https://www.utf8-chartable.de/unicode-utf8-table.pl I shall have to check what my preferred language's wrapper does. It is incompatible, in the sense that it uses an encoding that the UTF-8 specification says in invalid, and thus an application that performs fully all the tests on valid data forms would reject it. In many ways it is a compatible extension in that excluding the test that specifically makes the form invalid, doing the processing by the general rules of UTF-8, gives the expected result. C Strings do not allow 0 bytes in them. This would normally mean that they do not allow the NUL character to be in a string. This extension allows a character which would be interpreted as the NUL character to be represented without needing a 0 byte. It should be pointed out that most libraries won't be checking all the strings that pass through them to see if they violate the rule, as that is just adding a lot of overhead for very little benefit. It is really expected that applications will do this sort of test at the borders, when possibly untrusted strings come in, and know that if good strings come in, the following processing will keep the strings valid. -- Richard Damon ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 10 Jan 2020, at 18:55, Keith Medcalf wrote: > On Friday, 10 January, 2020 11:44, Tim Streater wrote: > >>On 10 Jan 2020, at 18:03, Richard Hipp wrote: > >>> On 1/10/20, Dominique Devienne wrote: There's no way at all, to know the length of a text column with embedded NULLs? > >>> You can find the true length of a string in bytes from C-code using >>> the sqlite3_column_bytes() interface. But I cannot, off-hand, think >>> of a way to do that from SQL. > >>But if I store UTF-8 in a TEXT column, surely I'm allowed to include >>NULLs in that? They are after all valid UTF-8 characters. > > No, they are not. The "NUL character" in Modified UTF-8 is the two-byte > sequence 0xC0 0x80. This is specifically so that 0x00 can be used as a string > terminator. Validly encoded UTF-8 encoded text stored in a C String (0x00 > terminated sequence of bytes) must not contain an embedded 0x00 byte since > that byte terminates the sequence. Nice, but Wikipedia has that as a "Derivative" and "incompatible with the UTF-8 specification and may be rejected by conforming UTF-8 applications." It appears (though I may have missed it) not to be mentioned on this handy site either: https://www.utf8-chartable.de/unicode-utf8-table.pl I shall have to check what my preferred language's wrapper does. -- Cheers -- Tim ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 1/10/20 1:43 PM, Tim Streater wrote: On 10 Jan 2020, at 18:03, Richard Hipp wrote: On 1/10/20, Dominique Devienne wrote: There's no way at all, to know the length of a text column with embedded NULLs? You can find the true length of a string in bytes from C-code using the sqlite3_column_bytes() interface. But I cannot, off-hand, think of a way to do that from SQL. But if I store UTF-8 in a TEXT column, surely I'm allowed to include NULLs in that? They are after all valid UTF-8 characters. As has been said, C Strings (which is what TEXT is assumed to hold) are not allowed to include null characters, but are assumed to terminate at the first 0 bytes. There is a variant of UTF-8, called modified UTF-8 or MUTF-8, which allows a null character to be encoded as C0 80, which does decode to 0 by the base UTF-8 rules, but is disallowed by the minimum encoding rule, which can be used to embed nulls in strings if the system doesn't enforce the minimum length encoding rule (at least for this character). I have no idea if that would work with SQLite though. -- Richard Damon ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Friday, 10 January, 2020 11:44, Tim Streater wrote: >On 10 Jan 2020, at 18:03, Richard Hipp wrote: >> On 1/10/20, Dominique Devienne wrote: >>> There's no way at all, to know the length of a text column with >>> embedded NULLs? >> You can find the true length of a string in bytes from C-code using >> the sqlite3_column_bytes() interface. But I cannot, off-hand, think >> of a way to do that from SQL. >But if I store UTF-8 in a TEXT column, surely I'm allowed to include >NULLs in that? They are after all valid UTF-8 characters. No, they are not. The "NUL character" in Modified UTF-8 is the two-byte sequence 0xC0 0x80. This is specifically so that 0x00 can be used as a string terminator. Validly encoded UTF-8 encoded text stored in a C String (0x00 terminated sequence of bytes) must not contain an embedded 0x00 byte since that byte terminates the sequence. -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 10 Jan 2020, at 18:03, Richard Hipp wrote: > On 1/10/20, Dominique Devienne wrote: >> >> There's no way at all, to know the length of a text column with embedded >> NULLs? >> > > You can find the true length of a string in bytes from C-code using > the sqlite3_column_bytes() interface. But I cannot, off-hand, think > of a way to do that from SQL. But if I store UTF-8 in a TEXT column, surely I'm allowed to include NULLs in that? They are after all valid UTF-8 characters. -- Cheers -- Tim ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Friday, 10 January, 2020 10:50, Dominique Devienne : >On Fri, Jan 10, 2020 at 4:30 PM Richard Hipp wrote: >> length() on a BLOB should show the number of bytes in the BLOB. >> length() on a string should show the number of *characters* (not >> bytes) in the string up through but not including the first >> zero-character. It is possible to have additional content after the >> first zero-character in a string, which length() will not tell you >> about. > There's no way at all, to know the length of a text column with embedded > NULLs? C-Strings cannot have embedded nulls. The very definition of a C-String is "a sequence of non-zero characters followed by a zero character". So while you can store and retrieve invalid C-Strings in the database (as in use BIND and COLUMN_TEXT), "things" (meaning software) which expects a C-String to be a C-String will be confused by such improper usage of a C String, and "things" which expect "text" fields to contain properly encoded C-Strings are likely to have brain-seizures. You can, of course, cast the column as a blob (which IS allowed to have embedded nulls and DOES NOT have to have valid text encoding), and get the length of that bag-o-bytes (in bytes): sqlite> select * from copied; 85960605|,|0DE19F8AA100D2962FF22C60B3FA8940 85960465|,|A44615408E8B3E48A684C60CA2967031 85960451|,|33C8804354984071A740788AD308B279 sqlite> select typeof(id_local),typeof(lutfullstring),typeof(luthash) from copied; integer|text|text integer|text|text integer|text|text sqlite> select length(cast(lutfullstring as blob)) from copied; 194238 183050 193908 The first select is indeed selecting the ENTIRE string that has been asked for. It is then using printf "%s" to print it because it is SUPPOSED TO BE a null terminated string. It is doing exactly what it has been told to do. Similarly, the function LENGTH() on what is supposed to be a NULL terminated string returns the number of characters up to but not including the NULL terminator. The real problem here is that a BLOB has been stored as if it were TEXT. This is a failure of the application to properly sanitize its input. -- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 1/10/20, Dominique Devienne wrote: > > There's no way at all, to know the length of a text column with embedded > NULLs? > You can find the true length of a string in bytes from C-code using the sqlite3_column_bytes() interface. But I cannot, off-hand, think of a way to do that from SQL. -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On Fri, Jan 10, 2020 at 4:30 PM Richard Hipp wrote: > length() on a BLOB should show the number of bytes in the BLOB. > > length() on a string should show the number of *characters* (not > bytes) in the string up through but not including the first > zero-character. It is possible to have additional content after the > first zero-character in a string, which length() will not tell you about. Hi Richard, There's no way at all, to know the length of a text column with embedded NULLs? --DD ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 10 Jan 2020, at 3:11pm, Ryan Mack wrote: > OK, I think I've got a better sense now. Hex encoding the column shows that > there's actually a huge amount of data stored in there. For some reason > length() isn't revealing it even if the column type is blob. Dumping and > restoring the table is truncating the data. Is there a chance that some part of the software thinks that 0x00 is a terminator for BLOBs, and some other part ignores the 0x00 and abides strictly by the length ? I've seen that before. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 1/10/20, Ryan Mack wrote: > OK, I think I've got a better sense now. Hex encoding the column shows > that there's actually a huge amount of data stored in there. For some > reason length() isn't revealing it even if the column type is blob. > Dumping and restoring the table is truncating the data. length() on a BLOB should show the number of bytes in the BLOB. length() on a string should show the number of *characters* (not bytes) in the string up through but not including the first zero-character. It is possible to have additional content after the first zero-character in a string, which length() will not tell you about. -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
OK, I think I've got a better sense now. Hex encoding the column shows that there's actually a huge amount of data stored in there. For some reason length() isn't revealing it even if the column type is blob. Dumping and restoring the table is truncating the data. On Fri, Jan 10, 2020 at 9:58 AM Ryan Mack wrote: > > Accidentally sent my first reply direct, responding to the list. I'm > now wondering if there's a lot of binary data hidden in each row. > Trying to figure out how to determine that if length() doesn't show > anything. > > > Prior response: > > An excellent idea, thank you :-) . The output is included below. > > I am continuing to do my own debugging in parallel. I am now leaning > towards a new hypothesis that there may be a bug handling > strange/malformed unicode that is resulting in a large amount of > garbage data being stored into the record. I'm trying to figure out > the exact conditions of copying/loading data into the table that > causes the issue to propagate or disappear. > > > % ./sqlite3_analyzer ~/Downloads/200k-per-row.sqlite > /** Disk-Space Utilization Report For > /Users/nadia/Downloads/200k-per-row.sqlite > > Page size in bytes 4096 > Pages in the whole file (measured) 143 > Pages in the whole file (calculated).. 143 > Pages that store data. 143100.0% > Pages on the freelist (per header) 00.0% > Pages on the freelist (calculated) 00.0% > Pages of auto-vacuum overhead. 00.0% > Number of tables in the database.. 2 > Number of indices. 0 > Number of defined indices. 0 > Number of implied indices. 0 > Size of the file in bytes. 585728 > Bytes of user payload stored.. 571322 97.5% > > *** Page counts for all tables with their indices > * > > COPIED 142 99.30% > SQLITE_MASTER. 10.70% > > *** Page counts for all tables and indices separately > * > > COPIED 142 99.30% > SQLITE_MASTER. 10.70% > > *** All tables > > > Percentage of total database.. 100.0% > Number of entries. 4 > Bytes of storage consumed. 585728 > Bytes of payload.. 571412 97.6% > Bytes of metadata. 1284 0.22% > Average payload per entry. 142853.00 > Average unused bytes per entry 3393.00 > Average metadata per entry 321.00 > Average fanout 3.00 > Maximum payload per entry. 194280 > Entries that use overflow. 3 75.0% > Index pages used.. 1 > Primary pages used 4 > Overflow pages used... 138 > Total pages used.. 143 > Unused bytes on index pages... 407099.37% > Unused bytes on primary pages. 950258.0% > Unused bytes on overflow pages 00.0% > Unused bytes on all pages. 135722.3% > > *** Table COPIED > ** > > Percentage of total database.. 99.30% > Number of entries. 3 > Bytes of storage consumed. 581632 > Bytes of payload.. 571322 98.2% > Bytes of metadata. 1172 0.20% > B-tree depth.. 2 > Average payload per entry. 190440.67 > Average unused bytes per entry 3226.00 > Average metadata per entry 390.67 > Average fanout 3.00 > Non-sequential pages.. 00.0% > Maximum payload per entry. 194280 > Entries that use overflow. 3 100.0% > Index pages used.. 1 > Primary pages used 3 > Overflow pages used... 138 > Total pages used.. 142 > Unused bytes on index pages... 407099.37% > Unused bytes on primary pages. 560845.6%
Re: [sqlite] Unexplained table bloat
Accidentally sent my first reply direct, responding to the list. I'm now wondering if there's a lot of binary data hidden in each row. Trying to figure out how to determine that if length() doesn't show anything. Prior response: An excellent idea, thank you :-) . The output is included below. I am continuing to do my own debugging in parallel. I am now leaning towards a new hypothesis that there may be a bug handling strange/malformed unicode that is resulting in a large amount of garbage data being stored into the record. I'm trying to figure out the exact conditions of copying/loading data into the table that causes the issue to propagate or disappear. % ./sqlite3_analyzer ~/Downloads/200k-per-row.sqlite /** Disk-Space Utilization Report For /Users/nadia/Downloads/200k-per-row.sqlite Page size in bytes 4096 Pages in the whole file (measured) 143 Pages in the whole file (calculated).. 143 Pages that store data. 143100.0% Pages on the freelist (per header) 00.0% Pages on the freelist (calculated) 00.0% Pages of auto-vacuum overhead. 00.0% Number of tables in the database.. 2 Number of indices. 0 Number of defined indices. 0 Number of implied indices. 0 Size of the file in bytes. 585728 Bytes of user payload stored.. 571322 97.5% *** Page counts for all tables with their indices * COPIED 142 99.30% SQLITE_MASTER. 10.70% *** Page counts for all tables and indices separately * COPIED 142 99.30% SQLITE_MASTER. 10.70% *** All tables Percentage of total database.. 100.0% Number of entries. 4 Bytes of storage consumed. 585728 Bytes of payload.. 571412 97.6% Bytes of metadata. 1284 0.22% Average payload per entry. 142853.00 Average unused bytes per entry 3393.00 Average metadata per entry 321.00 Average fanout 3.00 Maximum payload per entry. 194280 Entries that use overflow. 3 75.0% Index pages used.. 1 Primary pages used 4 Overflow pages used... 138 Total pages used.. 143 Unused bytes on index pages... 407099.37% Unused bytes on primary pages. 950258.0% Unused bytes on overflow pages 00.0% Unused bytes on all pages. 135722.3% *** Table COPIED ** Percentage of total database.. 99.30% Number of entries. 3 Bytes of storage consumed. 581632 Bytes of payload.. 571322 98.2% Bytes of metadata. 1172 0.20% B-tree depth.. 2 Average payload per entry. 190440.67 Average unused bytes per entry 3226.00 Average metadata per entry 390.67 Average fanout 3.00 Non-sequential pages.. 00.0% Maximum payload per entry. 194280 Entries that use overflow. 3 100.0% Index pages used.. 1 Primary pages used 3 Overflow pages used... 138 Total pages used.. 142 Unused bytes on index pages... 407099.37% Unused bytes on primary pages. 560845.6% Unused bytes on overflow pages 00.0% Unused bytes on all pages. 9678 1.7% *** Table SQLITE_MASTER *** Percentage of total database.. 0.70% Number of entries. 1 Bytes of storage consumed. 4096 Bytes of payload.. 90 2.2% Bytes of
Re: [sqlite] Unexplained table bloat
On 1/10/20, Ryan Mack wrote: > > I'm trying to understand unexplained table bloat The sqlite3_analyzer command-line utility program (available in the "Precompiled binaries" bundles on the https://sqlite.org/download.html page) is designed to help understand these kinds of problems. Please run that utility on the database and perhaps post the output here. -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
Thank you, I was unaware of the integrity_check pragma. It returns OK for the database in question. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
On 10 Jan 2020, at 2:06pm, David Raymond wrote: > Well something's weird anyway. When I open it with the command line tool it > queries it just fine Did you run an integrity_check on the database ? It looks from your posts as if it's corrupt. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unexplained table bloat
Well something's weird anyway. When I open it with the command line tool it queries it just fine. I tried to open it in Python to check all the characters in the strings and got this: Traceback (most recent call last): File "...\Testing4.py", line 8, in cur.execute("select * from copied;") sqlite3.OperationalError: Could not decode to UTF-8 column 'LUTFullString' with text ',' Which I suppose it just as likely to be my own problem though. -Original Message- From: sqlite-users On Behalf Of Ryan Mack Sent: Friday, January 10, 2020 8:48 AM To: sqlite-users@mailinglists.sqlite.org Subject: [sqlite] Unexplained table bloat Hi list, I'm trying to understand unexplained table bloat I found in what should be a very small table in an old database file. If you dump/restore the database, the problem goes away. If you duplicate the table, the problem propagates. Schema: CREATE TABLE copied( id_local INT, LUTFullString, LUTHash ); Test Data: 85960605|,|0DE19F8AA100D2962FF22C60B3FA8940 85960465|,|A44615408E8B3E48A684C60CA2967031 85960451|,|33C8804354984071A740788AD308B279 In my test database file, each of these 3 rows is allocating about 40 4k overflow pages for a total database size of about 500k. The full database has about 4MB of actual data which takes up over 500MB on disk. If you want to see/reproduce the problem you'll need my test database file which I've uploaded here: https://mackman.net/200k-per-row.sqlite.zip (500kb download). I don't know why it doesn't compress better, those extra overflow pages must be getting filled with random garbage. My uninformed guess is there was a bug in the version of sqlite used at database creation time that computed an incorrect overflow threshold and is storing each byte of the row to its own page. Since the problem goes away with a dump/restore, I'm considering releasing a script to do that and mitigate the problem for affected users. Before doing that I would like to understand the problem better. Thanks for reading, Ryan PS: Here's some output from my debug session showing the 123 bytes of data is occupying 582k of space on disk in a freshly created table. % sqlite3 200k-per-row.sqlite SQLite version 3.28.0 2019-04-15 14:49:49 Enter ".help" for usage hints. sqlite> .tables copied sqlite> .schema copied CREATE TABLE copied( id_local INT, LUTFullString, LUTHash ); sqlite> select * from copied; 85960605|,|0DE19F8AA100D2962FF22C60B3FA8940 85960465|,|A44615408E8B3E48A684C60CA2967031 85960451|,|33C8804354984071A740788AD308B279 sqlite> select sum(length(id_local) + length(lutfullstring) + length(luthash)) from copied; 123 sqlite> create table copied2 as select * from copied; sqlite> SELECT name, SUM("pgsize") as sz FROM dbstat group by name; copied|581632 copied2|581632 sqlite_master|4096 ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Unexplained table bloat
Hi list, I'm trying to understand unexplained table bloat I found in what should be a very small table in an old database file. If you dump/restore the database, the problem goes away. If you duplicate the table, the problem propagates. Schema: CREATE TABLE copied( id_local INT, LUTFullString, LUTHash ); Test Data: 85960605|,|0DE19F8AA100D2962FF22C60B3FA8940 85960465|,|A44615408E8B3E48A684C60CA2967031 85960451|,|33C8804354984071A740788AD308B279 In my test database file, each of these 3 rows is allocating about 40 4k overflow pages for a total database size of about 500k. The full database has about 4MB of actual data which takes up over 500MB on disk. If you want to see/reproduce the problem you'll need my test database file which I've uploaded here: https://mackman.net/200k-per-row.sqlite.zip (500kb download). I don't know why it doesn't compress better, those extra overflow pages must be getting filled with random garbage. My uninformed guess is there was a bug in the version of sqlite used at database creation time that computed an incorrect overflow threshold and is storing each byte of the row to its own page. Since the problem goes away with a dump/restore, I'm considering releasing a script to do that and mitigate the problem for affected users. Before doing that I would like to understand the problem better. Thanks for reading, Ryan PS: Here's some output from my debug session showing the 123 bytes of data is occupying 582k of space on disk in a freshly created table. % sqlite3 200k-per-row.sqlite SQLite version 3.28.0 2019-04-15 14:49:49 Enter ".help" for usage hints. sqlite> .tables copied sqlite> .schema copied CREATE TABLE copied( id_local INT, LUTFullString, LUTHash ); sqlite> select * from copied; 85960605|,|0DE19F8AA100D2962FF22C60B3FA8940 85960465|,|A44615408E8B3E48A684C60CA2967031 85960451|,|33C8804354984071A740788AD308B279 sqlite> select sum(length(id_local) + length(lutfullstring) + length(luthash)) from copied; 123 sqlite> create table copied2 as select * from copied; sqlite> SELECT name, SUM("pgsize") as sz FROM dbstat group by name; copied|581632 copied2|581632 sqlite_master|4096 ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users