Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-26 Thread Paul van Helden
>
>
> I experimented with a number of similar ideas for storing JSON when I
> was first designing the JSON components for SQLite.  I was never able
> to find anything that was as fast or as compact as just storing the
> original JSON text.
>

 I've also done a lot of experiments and was surprised at how little a
binary encoding saves in space. Also tried with lookups for keys, but the
lookup values quickly become close to the size of the keys (if not larger)
if keys are mostly shortish.

I'd be happy with a JSON5-like ability to have the quotes on keys optional
if they contain no spaces and no special characters. Seems to reduce the
data size quite significantly.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-25 Thread Jens Alfke


> On Feb 25, 2020, at 6:12 AM, J Decker  wrote:
> 
> other than that; if space is really a concern, maybe a zip layer?

In my experience, the concern is more about speed than size. Given the raw 
string/blob data from a SQLite column, and a specific property name/path, how 
fast can you find its value, convert it to a format SQLite's query engine 
understand, and return it?

The JSON1 extension's parser seems pretty darn fast, but given the nature of 
JSON, it has to do a bunch of scanning that's O(n) with the length of the data. 
It's generally impossible to find a value in the middle of a JSON string 
without examining every single byte that came before it.

The way to go faster is to use formats that let you jump through the structure 
faster. For example, tokenizing dictionary keys and encoding an object's keys 
as a fixed-width sorted list lets you look up a key in O(log n) time, and the 
constant factor is very small because you're comparing integers not strings.

I don't know if extracting 'classes' will help much in a query. The data will 
be smaller, and it makes it extremely fast to look up values if you know the 
class ahead of time, but in a query you don't know the class. Compared to what 
I described above, there's an extra step where you have to look up the class 
description from the object.

I also worry about use cases where the number of 'classes' becomes unwieldy, 
because the schema might be using a huge set of keys. For example, something 
where a customer order contains an object that maps SKUs to quantities, like 
{"A73563M1": 3, "A73522M0": 7, …}. And there are tens of thousands of SKUs.

—Jens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-25 Thread J Decker
On Fri, Feb 21, 2020 at 6:03 AM Richard Hipp  wrote:

> On 2/21/20, Wout Mertens  wrote:
> > The idea is that upon storing the JSON
> > data, the JSON1 extension parses it, extracts the layouts recursively,
> > stores them when they are not known yet, and then only stores the
> > values in the binary format with the layout identifiers.
>
> I experimented with a number of similar ideas for storing JSON when I
> was first designing the JSON components for SQLite.  I was never able
> to find anything that was as fast or as compact as just storing the
> original JSON text.  But I could have overlooked something.  If you
> have example code for a mechanism that is more space efficient and/or
> faster, please share it with us.
>

text is as long as text is, and numbers, for small ranges, are also
compressed to 2 bytes (one for a separator, or opener, and 1 for the value)
gets you 0-9 (0-64 if you base64 encode it)... looking at just the data
part of JSON.  You end up with a lot of overhead from the repeated field
name definition.

I created a format
https://github.com/d3x0r/jsox#jsox--javascript-object-exchange-format that
is compatible with existing JSON, but adds the ability to specify 'class'
definitions.  There's a specification of the grammar in bnf format, and
pictures... It tracks the current parsing state, 0, initial being called
'unknown'.  If a string is found in an unknown state, followed by an
object, then that defines a class type... 'record{id,title,author,data}'
then after than, a second occurrence in the unknown state, or within an
array or object context, '[record{1342,"book","person",1}]'  would use the
existing list of names in order with the values, and build an object that
was { id:1342,title:"book",author:"person",data:1 }.   'record' could be
shortened to any single unicode character, otherwise the saving isn't so
great.
The definition of 'string' is sort of loose in JSOX, as long as there isn't
a format control character ( whitespace, ':', '{', '}', '[', ']' ) you
don't need quotes around a sequence of characters to make a string;
excepting of course starting with characters that look like a number,
and/or match a keyword...
The triggering of the mode is '{' after a string, or while collecting a
string

I also extended the number format of JSON to allow specifying ISO-8601
times as numbers (just have to special case in addition to '.'; ':' 'T'
'Z' '-' (inline and not just at start)).

other than that; if space is really a concern, maybe a zip layer?

J


> --
> D. Richard Hipp
> d...@sqlite.org
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-21 Thread Martin Raiber
On 21.02.2020 15:03 Richard Hipp wrote:
> On 2/21/20, Wout Mertens  wrote:
>> The idea is that upon storing the JSON
>> data, the JSON1 extension parses it, extracts the layouts recursively,
>> stores them when they are not known yet, and then only stores the
>> values in the binary format with the layout identifiers.
> I experimented with a number of similar ideas for storing JSON when I
> was first designing the JSON components for SQLite.  I was never able
> to find anything that was as fast or as compact as just storing the
> original JSON text.  But I could have overlooked something.  If you
> have example code for a mechanism that is more space efficient and/or
> faster, please share it with us.

If you want to be as space efficient as possible, look into succinct
data structures. The balanced parenthesis representation of a tree can
be stored in a bit vector (each node 2 bits) and there are (succinct)
index structures to query that efficiently.

See e.g. https://core.ac.uk/download/pdf/81941172.pdf and
https://github.com/simongog/sdsl-lite/blob/master/include/sdsl/bp_support_g.hpp
as an implementation. Making this work with JSON would be a lot of work,
though, I guess.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-21 Thread Jens Alfke


> On Feb 21, 2020, at 4:20 AM, Wout Mertens  wrote:
> 
> I was wondering if the JSON extension could not do the same thing: for each
> table, keep a hidden stash of object layouts, and store the values as
> sqlite primitives. (you'd be able to disable this, in case the layouts
> rarely repeat)

We do something a bit like this in Couchbase Lite[1]. We don't directly store 
JSON, we store a binary data format we designed called Fleece, which has the 
same semantics but is faster to traverse and more compact.

https://github.com/couchbaselabs/fleece

In Fleece a JSON object is represented as an array of fixed-width {key, value} 
pairs.
- The keys are typically 16-bit integers that map into a global (per-database) 
table of common key strings.
- The values are either inline if they fit (small ints, boolean, null) or else 
offsets to where the real data is.
The array is sorted by key, so lookup is a binary search, usually on 16-bit 
ints.

Numbers are stored in a compact variable-width encoding. Strings are unescaped 
UTF-8, and multiple occurrences of the same string are only stored once.

(I've recently discovered that this is similar to FlexBuffers. Fleece has been 
around since 2015, so I don't know if the designers of FlexBuffers looked at 
it…)

I've thought about the 'layouts' concept but haven't tried implementing it yet. 
(Most JavaScript runtimes use something like it, btw.) It would only save about 
4*n bytes per record, where n is the average number of keys in a layout, and a 
small number of clock cycles per key lookup.

> On Feb 21, 2020, at 6:03 AM, Richard Hipp  wrote:
> 
> I experimented with a number of similar ideas for storing JSON when I
> was first designing the JSON components for SQLite.  I was never able
> to find anything that was as fast or as compact as just storing the
> original JSON text.

I had similar experiences with BSON, Bencode, etc. In my use case I found that 
the major expense wasn't parsing, rather allocating an object tree. So I 
designed the Fleece format to be fully useable in-place without requiring 
parsing. A secondary factor is that traversing objects and arrays is expensive 
because the items are variable width, so you have to parse through all the 
(potentially nested) items before the one you're looking for. That's why Fleece 
objects and arrays are fixed-width, using offsets to point to variable-size 
values.

—Jens

[1] https://www.couchbase.com/products/mobile
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-21 Thread Jens Alfke


> On Feb 21, 2020, at 4:20 AM, Wout Mertens  wrote:
> 
> In JavaScript, objects are key-value collections with unique keys, where the 
> order of the keys is important.

JSON is not JavaScript. The order of keys is NOT significant in JSON, and many, 
many JSON implementations parse JSON objects into data structures* that do not 
preserve key order. Clients using such implementations cannot determine the 
original order of keys, nor can they re-serialize the object with the same 
ordering.

I have run into multiple issues over time with systems that assume JSON key 
ordering is significant and preserved (CouchDB and Scuttlebutt come to mind), 
which then later have to redesign protocols or data schema because of 
interoperability problems this causes.

—Jens

* The platform's standard "dictionary" / "map" class is typically implemented 
as a hash table or binary tree, as in Java, .NET, Objective-C, Swift, Go, Lua, 
C++. (I'm unsure about Python and Ruby.) A few platforms besides JS keep an 
auxiliary key array to preserve ordering. Erlang uses a linked-list, which is 
ordered.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-21 Thread Carl Edquist
If you have example code for a mechanism that is more space efficient 
and/or faster, please share it with us.


"Bencode" is approximately the same space-wise as JSON, but 
encoding/decoding is potentially faster since it doesn't have to do any 
escaping for strings:


https://en.wikipedia.org/wiki/Bencode


On Fri, 21 Feb 2020, Richard Hipp wrote:


On 2/21/20, Wout Mertens  wrote:

The idea is that upon storing the JSON
data, the JSON1 extension parses it, extracts the layouts recursively,
stores them when they are not known yet, and then only stores the
values in the binary format with the layout identifiers.


I experimented with a number of similar ideas for storing JSON when I
was first designing the JSON components for SQLite.  I was never able
to find anything that was as fast or as compact as just storing the
original JSON text.  But I could have overlooked something.  If you
have example code for a mechanism that is more space efficient and/or
faster, please share it with us.

--
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-21 Thread Wout Mertens
On Fri, Feb 21, 2020 at 3:03 PM Richard Hipp  wrote:
> If you
> have example code for a mechanism that is more space efficient and/or
> faster, please share it with us.

I'll see if I can prototype something in JS - I'd be keeping the
layouts in a helper table, and I wouldn't be storing the values in
binary but it'd be a starting point.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-21 Thread Richard Hipp
On 2/21/20, Wout Mertens  wrote:
> The idea is that upon storing the JSON
> data, the JSON1 extension parses it, extracts the layouts recursively,
> stores them when they are not known yet, and then only stores the
> values in the binary format with the layout identifiers.

I experimented with a number of similar ideas for storing JSON when I
was first designing the JSON components for SQLite.  I was never able
to find anything that was as fast or as compact as just storing the
original JSON text.  But I could have overlooked something.  If you
have example code for a mechanism that is more space efficient and/or
faster, please share it with us.

-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-21 Thread Wout Mertens
On Fri, Feb 21, 2020 at 2:37 PM Warren Young  wrote:
>
> On Feb 21, 2020, at 5:20 AM, Wout Mertens  wrote:
> > Queries can go faster, because a query like `where json_extract(json,
> > '$.foo') = 'bar'` can first check the layouts to see which ones apply,
>
> SQLite’s JSON1 extension is a storage and query mechanism, not a run-time 
> object system. I se that things like json_remove() exist, but my assumption 
> is that 99.manynines% of the time, objects are stored, retrieved, and queried 
> without being modified at the SQLite level, if at all.
>
> Therefore, 99.manynines% of the time, there is only one “layout.”

Hmm, that is not what I meant. The idea is that upon storing the JSON
data, the JSON1 extension parses it, extracts the layouts recursively,
stores them when they are not known yet, and then only stores the
values in the binary format with the layout identifiers.

So yes, somewhat more work than storing and retrieving a plain string.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] more efficient JSON encoding: idle musing

2020-02-21 Thread Warren Young
On Feb 21, 2020, at 5:20 AM, Wout Mertens  wrote:
> 
> In JavaScript, objects are key-value collections with unique keys, where
> the order of the keys is important.

ECMAScript §13.7.5.15 (2019 edition) says, "The mechanics and order of 
enumerating the properties is not specified but must conform to the rules 
specified below.”

You can read the rules if you like, but it doesn’t say that every JavaScript 
give the same ordering:


https://www.ecma-international.org/ecma-262/10.0/index.html#sec-enumerate-object-properties

The JSON spec then says, “The JSON syntax...does not assign any significance to 
the ordering of name/value pairs."

http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf

> Most JSVMs

I’m going to take a wild guess that “Most” here means “a whole bunch of 
different browsers and server-side JS stacks all using V8.”

> Queries can go faster, because a query like `where json_extract(json,
> '$.foo') = 'bar'` can first check the layouts to see which ones apply,

SQLite’s JSON1 extension is a storage and query mechanism, not a run-time 
object system. I se that things like json_remove() exist, but my assumption is 
that 99.manynines% of the time, objects are stored, retrieved, and queried 
without being modified at the SQLite level, if at all.

Therefore, 99.manynines% of the time, there is only one “layout.”

To the extent that that is true, I don’t see how the rest of your proposal 
matters.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] more efficient JSON encoding: idle musing

2020-02-21 Thread Wout Mertens
Hi,

I use SQLite as a MaybeSQL store, mixing fixed columns with schemaless JSON
columns. It's really great.

In JavaScript, objects are key-value collections with unique keys, where
the order of the keys is important. Most JSVMs store them as a pointer to a
layout and then the values. The layout lists the keys in order (and
possibly the types, to get byte-perfect layouts). If you delete a key from
an object, it generates a new layout.

I was wondering if the JSON extension could not do the same thing: for each
table, keep a hidden stash of object layouts, and store the values as
sqlite primitives. (you'd be able to disable this, in case the layouts
rarely repeat)

This way:

   - it's more space efficient, since they keys are only stored once per
   layout
   - loading is faster, because the values are stored as their primitive
   type
   - querying can go faster

Queries can go faster, because a query like `where json_extract(json,
'$.foo') = 'bar'` can first check the layouts to see which ones apply,
allowing to skip other layouts, and then quickly find the value to test
thanks to the binary encoding

You could also allow an optimization that makes key order unimportant,
reducing the number of layouts.

So, smaller and faster. Thoughts?

Wout.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users