Re: [sqlite] more efficient JSON encoding: idle musing
> > > I experimented with a number of similar ideas for storing JSON when I > was first designing the JSON components for SQLite. I was never able > to find anything that was as fast or as compact as just storing the > original JSON text. > I've also done a lot of experiments and was surprised at how little a binary encoding saves in space. Also tried with lookups for keys, but the lookup values quickly become close to the size of the keys (if not larger) if keys are mostly shortish. I'd be happy with a JSON5-like ability to have the quotes on keys optional if they contain no spaces and no special characters. Seems to reduce the data size quite significantly. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
> On Feb 25, 2020, at 6:12 AM, J Decker wrote: > > other than that; if space is really a concern, maybe a zip layer? In my experience, the concern is more about speed than size. Given the raw string/blob data from a SQLite column, and a specific property name/path, how fast can you find its value, convert it to a format SQLite's query engine understand, and return it? The JSON1 extension's parser seems pretty darn fast, but given the nature of JSON, it has to do a bunch of scanning that's O(n) with the length of the data. It's generally impossible to find a value in the middle of a JSON string without examining every single byte that came before it. The way to go faster is to use formats that let you jump through the structure faster. For example, tokenizing dictionary keys and encoding an object's keys as a fixed-width sorted list lets you look up a key in O(log n) time, and the constant factor is very small because you're comparing integers not strings. I don't know if extracting 'classes' will help much in a query. The data will be smaller, and it makes it extremely fast to look up values if you know the class ahead of time, but in a query you don't know the class. Compared to what I described above, there's an extra step where you have to look up the class description from the object. I also worry about use cases where the number of 'classes' becomes unwieldy, because the schema might be using a huge set of keys. For example, something where a customer order contains an object that maps SKUs to quantities, like {"A73563M1": 3, "A73522M0": 7, …}. And there are tens of thousands of SKUs. —Jens ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
On Fri, Feb 21, 2020 at 6:03 AM Richard Hipp wrote: > On 2/21/20, Wout Mertens wrote: > > The idea is that upon storing the JSON > > data, the JSON1 extension parses it, extracts the layouts recursively, > > stores them when they are not known yet, and then only stores the > > values in the binary format with the layout identifiers. > > I experimented with a number of similar ideas for storing JSON when I > was first designing the JSON components for SQLite. I was never able > to find anything that was as fast or as compact as just storing the > original JSON text. But I could have overlooked something. If you > have example code for a mechanism that is more space efficient and/or > faster, please share it with us. > text is as long as text is, and numbers, for small ranges, are also compressed to 2 bytes (one for a separator, or opener, and 1 for the value) gets you 0-9 (0-64 if you base64 encode it)... looking at just the data part of JSON. You end up with a lot of overhead from the repeated field name definition. I created a format https://github.com/d3x0r/jsox#jsox--javascript-object-exchange-format that is compatible with existing JSON, but adds the ability to specify 'class' definitions. There's a specification of the grammar in bnf format, and pictures... It tracks the current parsing state, 0, initial being called 'unknown'. If a string is found in an unknown state, followed by an object, then that defines a class type... 'record{id,title,author,data}' then after than, a second occurrence in the unknown state, or within an array or object context, '[record{1342,"book","person",1}]' would use the existing list of names in order with the values, and build an object that was { id:1342,title:"book",author:"person",data:1 }. 'record' could be shortened to any single unicode character, otherwise the saving isn't so great. The definition of 'string' is sort of loose in JSOX, as long as there isn't a format control character ( whitespace, ':', '{', '}', '[', ']' ) you don't need quotes around a sequence of characters to make a string; excepting of course starting with characters that look like a number, and/or match a keyword... The triggering of the mode is '{' after a string, or while collecting a string I also extended the number format of JSON to allow specifying ISO-8601 times as numbers (just have to special case in addition to '.'; ':' 'T' 'Z' '-' (inline and not just at start)). other than that; if space is really a concern, maybe a zip layer? J > -- > D. Richard Hipp > d...@sqlite.org > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
On 21.02.2020 15:03 Richard Hipp wrote: > On 2/21/20, Wout Mertens wrote: >> The idea is that upon storing the JSON >> data, the JSON1 extension parses it, extracts the layouts recursively, >> stores them when they are not known yet, and then only stores the >> values in the binary format with the layout identifiers. > I experimented with a number of similar ideas for storing JSON when I > was first designing the JSON components for SQLite. I was never able > to find anything that was as fast or as compact as just storing the > original JSON text. But I could have overlooked something. If you > have example code for a mechanism that is more space efficient and/or > faster, please share it with us. If you want to be as space efficient as possible, look into succinct data structures. The balanced parenthesis representation of a tree can be stored in a bit vector (each node 2 bits) and there are (succinct) index structures to query that efficiently. See e.g. https://core.ac.uk/download/pdf/81941172.pdf and https://github.com/simongog/sdsl-lite/blob/master/include/sdsl/bp_support_g.hpp as an implementation. Making this work with JSON would be a lot of work, though, I guess. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
> On Feb 21, 2020, at 4:20 AM, Wout Mertens wrote: > > I was wondering if the JSON extension could not do the same thing: for each > table, keep a hidden stash of object layouts, and store the values as > sqlite primitives. (you'd be able to disable this, in case the layouts > rarely repeat) We do something a bit like this in Couchbase Lite[1]. We don't directly store JSON, we store a binary data format we designed called Fleece, which has the same semantics but is faster to traverse and more compact. https://github.com/couchbaselabs/fleece In Fleece a JSON object is represented as an array of fixed-width {key, value} pairs. - The keys are typically 16-bit integers that map into a global (per-database) table of common key strings. - The values are either inline if they fit (small ints, boolean, null) or else offsets to where the real data is. The array is sorted by key, so lookup is a binary search, usually on 16-bit ints. Numbers are stored in a compact variable-width encoding. Strings are unescaped UTF-8, and multiple occurrences of the same string are only stored once. (I've recently discovered that this is similar to FlexBuffers. Fleece has been around since 2015, so I don't know if the designers of FlexBuffers looked at it…) I've thought about the 'layouts' concept but haven't tried implementing it yet. (Most JavaScript runtimes use something like it, btw.) It would only save about 4*n bytes per record, where n is the average number of keys in a layout, and a small number of clock cycles per key lookup. > On Feb 21, 2020, at 6:03 AM, Richard Hipp wrote: > > I experimented with a number of similar ideas for storing JSON when I > was first designing the JSON components for SQLite. I was never able > to find anything that was as fast or as compact as just storing the > original JSON text. I had similar experiences with BSON, Bencode, etc. In my use case I found that the major expense wasn't parsing, rather allocating an object tree. So I designed the Fleece format to be fully useable in-place without requiring parsing. A secondary factor is that traversing objects and arrays is expensive because the items are variable width, so you have to parse through all the (potentially nested) items before the one you're looking for. That's why Fleece objects and arrays are fixed-width, using offsets to point to variable-size values. —Jens [1] https://www.couchbase.com/products/mobile ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
> On Feb 21, 2020, at 4:20 AM, Wout Mertens wrote: > > In JavaScript, objects are key-value collections with unique keys, where the > order of the keys is important. JSON is not JavaScript. The order of keys is NOT significant in JSON, and many, many JSON implementations parse JSON objects into data structures* that do not preserve key order. Clients using such implementations cannot determine the original order of keys, nor can they re-serialize the object with the same ordering. I have run into multiple issues over time with systems that assume JSON key ordering is significant and preserved (CouchDB and Scuttlebutt come to mind), which then later have to redesign protocols or data schema because of interoperability problems this causes. —Jens * The platform's standard "dictionary" / "map" class is typically implemented as a hash table or binary tree, as in Java, .NET, Objective-C, Swift, Go, Lua, C++. (I'm unsure about Python and Ruby.) A few platforms besides JS keep an auxiliary key array to preserve ordering. Erlang uses a linked-list, which is ordered. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
If you have example code for a mechanism that is more space efficient and/or faster, please share it with us. "Bencode" is approximately the same space-wise as JSON, but encoding/decoding is potentially faster since it doesn't have to do any escaping for strings: https://en.wikipedia.org/wiki/Bencode On Fri, 21 Feb 2020, Richard Hipp wrote: On 2/21/20, Wout Mertens wrote: The idea is that upon storing the JSON data, the JSON1 extension parses it, extracts the layouts recursively, stores them when they are not known yet, and then only stores the values in the binary format with the layout identifiers. I experimented with a number of similar ideas for storing JSON when I was first designing the JSON components for SQLite. I was never able to find anything that was as fast or as compact as just storing the original JSON text. But I could have overlooked something. If you have example code for a mechanism that is more space efficient and/or faster, please share it with us. -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
On Fri, Feb 21, 2020 at 3:03 PM Richard Hipp wrote: > If you > have example code for a mechanism that is more space efficient and/or > faster, please share it with us. I'll see if I can prototype something in JS - I'd be keeping the layouts in a helper table, and I wouldn't be storing the values in binary but it'd be a starting point. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
On 2/21/20, Wout Mertens wrote: > The idea is that upon storing the JSON > data, the JSON1 extension parses it, extracts the layouts recursively, > stores them when they are not known yet, and then only stores the > values in the binary format with the layout identifiers. I experimented with a number of similar ideas for storing JSON when I was first designing the JSON components for SQLite. I was never able to find anything that was as fast or as compact as just storing the original JSON text. But I could have overlooked something. If you have example code for a mechanism that is more space efficient and/or faster, please share it with us. -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
On Fri, Feb 21, 2020 at 2:37 PM Warren Young wrote: > > On Feb 21, 2020, at 5:20 AM, Wout Mertens wrote: > > Queries can go faster, because a query like `where json_extract(json, > > '$.foo') = 'bar'` can first check the layouts to see which ones apply, > > SQLite’s JSON1 extension is a storage and query mechanism, not a run-time > object system. I se that things like json_remove() exist, but my assumption > is that 99.manynines% of the time, objects are stored, retrieved, and queried > without being modified at the SQLite level, if at all. > > Therefore, 99.manynines% of the time, there is only one “layout.” Hmm, that is not what I meant. The idea is that upon storing the JSON data, the JSON1 extension parses it, extracts the layouts recursively, stores them when they are not known yet, and then only stores the values in the binary format with the layout identifiers. So yes, somewhat more work than storing and retrieving a plain string. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] more efficient JSON encoding: idle musing
On Feb 21, 2020, at 5:20 AM, Wout Mertens wrote: > > In JavaScript, objects are key-value collections with unique keys, where > the order of the keys is important. ECMAScript §13.7.5.15 (2019 edition) says, "The mechanics and order of enumerating the properties is not specified but must conform to the rules specified below.” You can read the rules if you like, but it doesn’t say that every JavaScript give the same ordering: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-enumerate-object-properties The JSON spec then says, “The JSON syntax...does not assign any significance to the ordering of name/value pairs." http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf > Most JSVMs I’m going to take a wild guess that “Most” here means “a whole bunch of different browsers and server-side JS stacks all using V8.” > Queries can go faster, because a query like `where json_extract(json, > '$.foo') = 'bar'` can first check the layouts to see which ones apply, SQLite’s JSON1 extension is a storage and query mechanism, not a run-time object system. I se that things like json_remove() exist, but my assumption is that 99.manynines% of the time, objects are stored, retrieved, and queried without being modified at the SQLite level, if at all. Therefore, 99.manynines% of the time, there is only one “layout.” To the extent that that is true, I don’t see how the rest of your proposal matters. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users