Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

2019-02-04 Thread Adam Kocoloski
I overlooked two important details when I sent this earlier email. I try not to 
be in the habit of replying to my own emails but here we go ...

First, in order to update a document while reading only the winning edit branch 
entry (the second property that I said I wanted to provide) we also need to 
include the versionstamp associated with the current leaf in the edit branch 
KV, so we know which entry to clear from the “by_seq” subspace. So a more 
complete example would be

(“_meta”, DocID, IsDeleted, RevPosition, RevHash) = (VersionstampForRev, 
[ParentRev, GrandparentRev, …])

Second ... that property isn’t really compatible with the compact edit conflict 
storage model I proposed, is it? If you need to merge in a new edit to an 
existing set of KVs you can’t very well just blindly clear the whole range.

This is an interesting balancing act. Perhaps there’s a good way for us to 
preserve the “fast path” which is applicable in 99% of circumstances and then 
only fallback to the “read the doc in order to update it” mode when we know 
there are multiple edit branches.

Adam

> On Feb 4, 2019, at 6:22 PM, Adam Kocoloski  wrote:
> 
> I think it’s fine to start a focused discussion here as it might help inform 
> some of the broader debate over in that thread.
> 
> As a reminder, today CouchDB writes the entire body of each document revision 
> on disk as a separate blob. Edit conflicts that have common fields between 
> them do not share any storage on disk. The revision tree is encoded into a 
> compact format and a copy of it is stored directly in both the by_id tree and 
> the by_seq tree. Each leaf entry in the revision tree contain a pointer to 
> the position of the associated doc revision on disk.
> 
> As a further reminder, CouchDB 2.x clusters can generate edit conflict 
> revisions just from multiple clients concurrently updating the same document 
> in a single cluster. This won’t happen when FoundationDB is running under the 
> hood, but users who deploy multiple CouchDB or PouchDB servers and replicate 
> between them can of course still produce conflicts just like they could in 
> CouchDB 1.x, so we need a solution.
> 
> Let’s consider the two sub-topics separately: 1) storage of edit conflict 
> bodies and 2) revision trees
> 
> ## Edit Conflict Storage
> 
> The simplest possible solution would be to store each document revision 
> separately, like we do today. We could store document bodies with (“docid”, 
> “revid”) as the key prefix, and each transaction could clear the key range 
> associated with the base revision against which the edit is being attempted. 
> This would work, but I think we can try to be a bit more clever and save on 
> storage space given that we’re splitting JSON documents into multiple KV 
> pairs.
> 
> One thought I’d had is to introduce a special enum Value which indicates that 
> the subtree “beneath” the given Key is in conflict. For example, consider the 
> documents
> 
> {
>“_id”: “foo”,
>“_rev”: “1-abc”,
>“owner”: “alice”,
>“active”: true
> }
> 
> and 
> 
> {
>“_id”: “foo”,
>“_rev”: “1-def”,
>“owner”: “bob”,
>“active”: true
> }
> 
> We could represent these using the following set of KVs:
> 
> (“foo”, “active”) = true
> (“foo”, “owner”) = kCONFLICT
> (“foo”, “owner”, “1-abc”) = “alice”
> (“foo”, “owner”, “1-def”) = “bob”
> 
> This approach also extends to conflicts where the two versions have different 
> data types. Consider a more complicated example where bob dropped the 
> “active” field and changed the “owner” field to an object:
> 
> {
>  “_id”: “foo”,
>  “_rev”: “1-def”,
>  “owner”: {
>“name”: “bob”,
>“email”: “b...@example.com"
>  }
> }
> 
> Now the set of KVs for “foo” looks like this (note that a missing field needs 
> to be handled explicitly):
> 
> (“foo”, “active”) = kCONFLICT
> (“foo”, “active”, “1-abc”) = true
> (“foo”, “active”, “1-def”) = kMISSING
> (“foo”, “owner”) = kCONFLICT
> (“foo”, “owner”, “1-abc”) = “alice”
> (“foo”, “owner”, “1-def”, “name”) = “bob”
> (“foo”, “owner”, “1-def”, “email”) = “b...@example.com”
> 
> I like this approach for the common case where documents share most of their 
> data in common but have a conflict in a very specific field or set of fields. 
> 
> I’ve encountered one important downside, though: an edit that replicates in 
> and conflicts with the entire document can cause a bit of a data explosion. 
> Consider a case where I have 10 conflicting versions of a 100KB document, but 
> the conflicts are all related to a single scalar value. Now I replicate in an 
> empty document, and suddenly I have a kCONFLICT at the root. In this model I 
> now need to list out every path of every one of the 10 existing revisions and 
> I end up with a 1MB update. Yuck. That’s technically no worse in the end 
> state than the “zero sharing” case above, but one could easily imagine 
> overrunning the transaction size limit this way.
> 
> I suspect there’s a smart path out of this. 

Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

2019-02-04 Thread Adam Kocoloski
I think it’s fine to start a focused discussion here as it might help inform 
some of the broader debate over in that thread.

As a reminder, today CouchDB writes the entire body of each document revision 
on disk as a separate blob. Edit conflicts that have common fields between them 
do not share any storage on disk. The revision tree is encoded into a compact 
format and a copy of it is stored directly in both the by_id tree and the 
by_seq tree. Each leaf entry in the revision tree contain a pointer to the 
position of the associated doc revision on disk.

As a further reminder, CouchDB 2.x clusters can generate edit conflict 
revisions just from multiple clients concurrently updating the same document in 
a single cluster. This won’t happen when FoundationDB is running under the 
hood, but users who deploy multiple CouchDB or PouchDB servers and replicate 
between them can of course still produce conflicts just like they could in 
CouchDB 1.x, so we need a solution.

Let’s consider the two sub-topics separately: 1) storage of edit conflict 
bodies and 2) revision trees

## Edit Conflict Storage

The simplest possible solution would be to store each document revision 
separately, like we do today. We could store document bodies with (“docid”, 
“revid”) as the key prefix, and each transaction could clear the key range 
associated with the base revision against which the edit is being attempted. 
This would work, but I think we can try to be a bit more clever and save on 
storage space given that we’re splitting JSON documents into multiple KV pairs.

One thought I’d had is to introduce a special enum Value which indicates that 
the subtree “beneath” the given Key is in conflict. For example, consider the 
documents

{
“_id”: “foo”,
“_rev”: “1-abc”,
“owner”: “alice”,
“active”: true
}

and 

{
“_id”: “foo”,
“_rev”: “1-def”,
“owner”: “bob”,
“active”: true
}

We could represent these using the following set of KVs:

(“foo”, “active”) = true
(“foo”, “owner”) = kCONFLICT
(“foo”, “owner”, “1-abc”) = “alice”
(“foo”, “owner”, “1-def”) = “bob”

This approach also extends to conflicts where the two versions have different 
data types. Consider a more complicated example where bob dropped the “active” 
field and changed the “owner” field to an object:

{
  “_id”: “foo”,
  “_rev”: “1-def”,
  “owner”: {
“name”: “bob”,
“email”: “b...@example.com"
  }
}

Now the set of KVs for “foo” looks like this (note that a missing field needs 
to be handled explicitly):

(“foo”, “active”) = kCONFLICT
(“foo”, “active”, “1-abc”) = true
(“foo”, “active”, “1-def”) = kMISSING
(“foo”, “owner”) = kCONFLICT
(“foo”, “owner”, “1-abc”) = “alice”
(“foo”, “owner”, “1-def”, “name”) = “bob”
(“foo”, “owner”, “1-def”, “email”) = “b...@example.com”

I like this approach for the common case where documents share most of their 
data in common but have a conflict in a very specific field or set of fields. 

I’ve encountered one important downside, though: an edit that replicates in and 
conflicts with the entire document can cause a bit of a data explosion. 
Consider a case where I have 10 conflicting versions of a 100KB document, but 
the conflicts are all related to a single scalar value. Now I replicate in an 
empty document, and suddenly I have a kCONFLICT at the root. In this model I 
now need to list out every path of every one of the 10 existing revisions and I 
end up with a 1MB update. Yuck. That’s technically no worse in the end state 
than the “zero sharing” case above, but one could easily imagine overrunning 
the transaction size limit this way.

I suspect there’s a smart path out of this. Maybe the system detects a 
“default” value for each field and uses that instead of writing out the value 
for every revision in a conflicted subtree. Worth some discussion.

## Revision Trees

In CouchDB we currently represent revisions as a hash history tree; each 
revision identifier is derived from the content of the revision including the 
revision identifier of its parent. Individual edit branches are bounded in 
*length* (I believe the default is 1000 entries), but the number of edit 
branches is technically unbounded.

The size limits in FoundationDB preclude us from storing the entire key tree as 
a single value; in pathological situations the tree could exceed 100KB. Rather, 
I think it would make sense to store each edit *branch* as a separate KV. We 
stem the branch long before it hits the value size limit, and in the happy case 
of no edit conflicts this means we store the edit history metadata in a single 
KV. It also means that we can apply an interactive edit without retrieving the 
entire conflicted revision tree; we need only retrieve and modify the single 
branch against which the edit is being applied. The downside is that we 
duplicate historical revision identifiers shared by multiple edit branches, but 
I think this is a worthwhile tradeoff.

I would furthermore try to structure the keys so 

Re: [DISCUSS] : things we need to solve/decide : changes feed

2019-02-04 Thread Adam Kocoloski
Probably good to take a quick step back and note that FoundationDB’s 
versionstamps are an elegant and scalable solution to atomically maintaining 
the index of documents in the order in which they were most recently updated. I 
think that’s what you mean by the first part of the problem, but I want to make 
sure that on the ML here we collectively understand that FoundationDB actually 
nails this hard part of the problem *really* well.

When you say “notify CouchDB about new updates”, are you referring to the 
feed=longpoll or feed=continuous options to the _changes API? I guess I see  
three different routes that can be taken here.

One route is to use the same kind machinery that we have in place today in 
CouchDB 2.x. As a reminder, the way this works is

-  a client waiting for changes on a DB spawns one local process and also a 
rexi RPC process on each node hosting one of the DB shards of interest (see 
fabric_db_update_listener). 
- those RPC processes register as local couch_event listeners, where they 
receive {db_updated, ShardName} messages forwarded to them from the 
couch_db_updater processes.

Of course, in the FoundationDB design we don’t need to serialize updates in 
couch_db_updater processes, but individual writers could just as easily fire 
off those db_updated messages. This design is already heavily optimized for 
large numbers of listeners on large numbers of databases. The downside that I 
can see is it means the *CouchDB layer nodes would need to form a distributed 
Erlang cluster* in order for them to learn about the changes being committed 
from other nodes in the cluster.

So let’s say we *didn’t* want to do that and we rather are trying to design for 
completely independent layer nodes that have no knowledge of or communication 
with one another save through FoundationDB. There’s definitely something to be 
said for that constraint. One very simple approach might be to just poll 
FoundationDB. If the database is under a heavy write load there’s no overhead 
to this approach; every time we finish sending the output of one range query 
against the versionstamp space and we re-acquire a new read version there will 
be new updates to consume. Where it gets inefficient is if we have a lot of 
listeners on the feed and a very low-throughput database. But one fiddle with 
polling intervals, or else have a layer of indirection so only one process on 
each layer node is doing the polling and then sending events to couch_event. I 
think this could scale quite far.

The other option (which I think is the one you’re homing in on) is to leverage 
FoundationDB’s watchers to get a push notification about updates to a 
particular key. I would be cautious about creating a specific key or set of 
keys specifically for this purpose, but, if we find that there’s some other bit 
of metadata that we are needing to maintain anyway then this could work nicely. 
I think same indirection that I described above (where each layer node has a 
maximum of one watcher per database, and it re-broadcasts messages to all 
interested clients via couch_event) would help us not be too constrained by the 
limit on watches.

So to recap, the three approaches

1. Writers publish db_updated events to couch_event, listeners use distributed 
Erlang to subscribe to all nodes
2. Poll the _changes subspace, scale by nominating a specific process per node 
to do the polling
3. Same as #2 but using a watch on DB metadata that changes with every update 
instead of polling

Adam

> On Feb 4, 2019, at 2:18 PM, Ilya Khlopotov  wrote:
> 
> One of the features of CouchDB, which doesn't map cleanly into FoudationDB is 
> changes feed. The essence of the feature is: 
> - Subscriber of the feed wants to receive notifications when database is 
> updated. 
> - The notification includes update_seq for the database and list of changes 
> which happen at that time. 
> - The change itself includes docid and rev. 
> Hi, 
> 
> There are multiple ways to easily solve this problem. Designing a scalable 
> way to do it is way harder.  
> 
> There are at least two parts to this problem:
> - how to structure secondary indexes so we can provide what we need in 
> notification event
> - how to notify CouchDB about new updates
> 
> For the second part of the problem we could setup a watcher on one of the 
> keys we have to update on every transaction. For example the key which tracks 
> the database_size or key which tracks the number of documents or we can add 
> our own key. The problem is at some point we would hit a capacity limit for 
> atomic updates of a single key (FoundationDB doesn't redistribute the load 
> among servers on per key basis). In such case we would have to distribute the 
> counter among multiple keys to allow FoundationDB to split the hot range. 
> Therefore, we would have to setup multiple watches. FoundationDB has a limit 
> on the number of watches the client can setup (10 watches). So we need to 
> keep in mind this 

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

2019-02-04 Thread Robert Newson
I've been remiss here in not posting the data model ideas that IBM worked up 
while we were thinking about using FoundationDB so I'm posting it now. This is 
Adam' Kocoloski's original work, I am just transcribing it, and this is the 
context that the folks from the IBM side came in with, for full disclosure.

Basics

1. All CouchDB databases are inside a Directory
2. Each CouchDB database is a Directory within that Directory
3. It's possible to list all subdirectories of a Directory, so `_all_dbs` is 
the list of directories from 1.
4. Each Directory representing a CouchdB database has several Subspaces;
4a. by_id/ doc subspace: actual document contents 
4b. by_seq/versionstamp subspace: for the _changes feed 
4c. index_definitions, indexes, ...

JSON Mapping

A hierarchical JSON object naturally maps to multiple KV pairs in FDB:

{ 
“_id”: “foo”, 
“owner”: “bob”, 
“mylist”: [1,3,5], 
“mymap”: { 
“blue”: “#FF”, 
“red”: “#FF” 
} 
}

maps to

(“foo”, “owner”) = “bob” 
(“foo”, “mylist”, 0) = 1 
(“foo”, “mylist”, 1) = 3 
(“foo”, “mylist”, 2) = 5 
(“foo”, “mymap”, “blue”) = “#FF” 
(“foo”, “mymap”, “red”) = “#FF”

NB: this means that the 100KB limit applies to individual leafs in the JSON 
object, not the entire doc

Edit Conflicts

We need to account for the presence of conflicts in various levels of the doc 
due to replication.

Proposal is to create a special value indicating that the subtree below our 
current cursor position is in an unresolvable conflict. Then add additional KV 
pairs below to describe the conflicting entries.

KV data model allows us to store these efficiently and minimize duplication of 
data:

A document with these two conflicts:

{ 
“_id”: “foo”, 
“_rev”: “1-abc”, 
“owner”: “alice”, 
“active”: true 
}
{ 
“_id”: “foo”, 
“_rev”: “1-def”, 
“owner”: “bob”, 
“active”: true 
}

could be stored thus:

(“foo”, “active”) = true 
(“foo”, “owner”) = kCONFLICT 
(“foo”, “owner”, “1-abc”) = “alice” 
(“foo”, “owner”, “1-def”) = “bob”

So long as `kCONFLICT` is set at the top of the conflicting subtree this 
representation can handle conflicts of different data types as well.

Missing fields need to be handled explicitly:

{ 
  “_id”: “foo”, 
  “_rev”: “1-abc”, 
  “owner”: “alice”, 
  “active”: true 
}

{ 
  “_id”: “foo”, 
  “_rev”: “1-def”, 
  “owner”: { 
“name”: “bob”, 
“email”: “
b...@example.com
" 
  } 
}

could be stored thus:

(“foo”, “active”) = kCONFLICT 
(“foo”, “active”, “1-abc”) = true 
(“foo”, “active”, “1-def”) = kMISSING 
(“foo”, “owner”) = kCONFLICT 
(“foo”, “owner”, “1-abc”) = “alice” 
(“foo”, “owner”, “1-def”, “name”) = “bob” 
(“foo”, “owner”, “1-def”, “email”) = ...

Revision Metadata

* CouchDB uses a hash history for revisions 
** Each edit is identified by the hash of the content of the edit including the 
base revision against which it was applied 
** Individual edit branches are bounded in length but the number of branches is 
potentially unbounded 

* Size limits preclude us from storing the entire key tree as a single value; 
in pathological situations 
the tree could exceed 100KB (each entry is > 16 bytes) 

* Store each edit branch as a separate KV including deleted status in a special 
subspace 

* Structure key representation so that “winning” revision can be automatically 
retrieved in a limit=1 
key range operation

(“foo”, “_meta”, “deleted=false”, 1, “def”) = [] 
(“foo”, “_meta”, “deleted=false”, 4, “bif”) = [“3-baz”,”2-bar”,”1-foo”]  <-- 
winner
(“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”]

Changes Feed

* FDB supports a concept called a versionstamp — a 10 byte, unique, 
monotonically (but not sequentially) increasing value for each committed 
transaction. The first 8 bytes are the committed version of the database. The 
last 2 bytes are monotonic in the serialization order for transactions. 

* A transaction can specify a particular index into a key where the following 
10 bytes will be overwritten by the versionstamp at commit time 

* A subspace keyed on versionstamp naturally yields a _changes feed

by_seq subspace 
  (“versionstamp1”) = (“foo”, “1-abc”) 
  (“versionstamp4”) = (“bar”, “4-def”) 

by_id subspace 
  (“bar”, “_vsn”) = “versionstamp4” 
  ... 
  (“foo”, “_vsn”) = “versionstamp1”

JSON Indexes

* “Mango” JSON indexes are defined by
** a list of field names, each of which may be nested,  
** an optional partial_filter_selector which constrains the set of docs that 
contribute 
** an optional name defined by the ddoc field (the name is auto-generated if 
not supplied) 

* Store index definitions in a single subspace to aid query planning 
** ((person,name), title, email) = (“name-title-email”, “{“student”: true}”) 
** Store the values for each index in a dedicated subspace, adding the document 
ID as the last element in the tuple 
*** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) = null

B.

-- 
  Robert Samuel Newson
  

Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

2019-02-04 Thread Robert Newson
This one is quite tightly coupled to the other thread on data model, should we 
start much conversation here before that one gets closer to a solution?

-- 
  Robert Samuel Newson
  rnew...@apache.org

On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote:
> This is a beginning of a discussion thread about storage of edit 
> conflicts and everything which relates to revisions.
> 
> 


Re: [DISCUSS] : things we need to solve/decide : changes feed

2019-02-04 Thread Robert Newson
Let's not conflate two things here.

The changes feed is just the documents in the database, including deleted ones, 
in the order they were last written. This is simple to do in FoundationDB and 
Adam already showed the solution using fdb's versionstamps.

The talk of 'watcher' and 'subscriber' is about the continuous mode of changes 
feed and this is a stateful part of the current couchdb system (as it is 
mediating the writes to the db). We know the watcher facility in fdb will not 
scale to this usage.

B.

-- 
  Robert Samuel Newson
  rnew...@apache.org

On Mon, 4 Feb 2019, at 19:18, Ilya Khlopotov wrote:
> One of the features of CouchDB, which doesn't map cleanly into 
> FoudationDB is changes feed. The essence of the feature is: 
> - Subscriber of the feed wants to receive notifications when database is 
> updated. 
> - The notification includes update_seq for the database and list of 
> changes which happen at that time. 
> - The change itself includes docid and rev. 
> Hi, 
> 
> There are multiple ways to easily solve this problem. Designing a 
> scalable way to do it is way harder.  
> 
> There are at least two parts to this problem:
> - how to structure secondary indexes so we can provide what we need in 
> notification event
> - how to notify CouchDB about new updates
> 
> For the second part of the problem we could setup a watcher on one of 
> the keys we have to update on every transaction. For example the key 
> which tracks the database_size or key which tracks the number of 
> documents or we can add our own key. The problem is at some point we 
> would hit a capacity limit for atomic updates of a single key 
> (FoundationDB doesn't redistribute the load among servers on per key 
> basis). In such case we would have to distribute the counter among 
> multiple keys to allow FoundationDB to split the hot range. Therefore, 
> we would have to setup multiple watches. FoundationDB has a limit on the 
> number of watches the client can setup (10 watches). So we need to 
> keep in mind this number when designing the feature. 
> 
> The single key update rate problem is very theoretical and we might 
> ignore it for the PoC version. Then we can measure the impact and change 
> design accordingly. The reason I decided to bring it up is to see maybe 
> someone has a simple solution to avoid the bottleneck. 
> 
> best regards,
> iilyak


[DISCUSS] : things we need to solve/decide : changes feed

2019-02-04 Thread Ilya Khlopotov
One of the features of CouchDB, which doesn't map cleanly into FoudationDB is 
changes feed. The essence of the feature is: 
- Subscriber of the feed wants to receive notifications when database is 
updated. 
- The notification includes update_seq for the database and list of changes 
which happen at that time. 
- The change itself includes docid and rev. 
Hi, 

There are multiple ways to easily solve this problem. Designing a scalable way 
to do it is way harder.  

There are at least two parts to this problem:
- how to structure secondary indexes so we can provide what we need in 
notification event
- how to notify CouchDB about new updates

For the second part of the problem we could setup a watcher on one of the keys 
we have to update on every transaction. For example the key which tracks the 
database_size or key which tracks the number of documents or we can add our own 
key. The problem is at some point we would hit a capacity limit for atomic 
updates of a single key (FoundationDB doesn't redistribute the load among 
servers on per key basis). In such case we would have to distribute the counter 
among multiple keys to allow FoundationDB to split the hot range. Therefore, we 
would have to setup multiple watches. FoundationDB has a limit on the number of 
watches the client can setup (10 watches). So we need to keep in mind this 
number when designing the feature. 

The single key update rate problem is very theoretical and we might ignore it 
for the PoC version. Then we can measure the impact and change design 
accordingly. The reason I decided to bring it up is to see maybe someone has a 
simple solution to avoid the bottleneck. 

best regards,
iilyak


Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

2019-02-04 Thread Ilya Khlopotov


I want to fix previous mistakes. I did two mistakes in previous calculations:
- I used 1Kb as base size for calculating expansion factor (although we don't 
know exact size of original document)
- The expansion factor calculation included number of revisions (it shouldn't)

I'll focus on flattened JSON docs model

The following formula is used in previous calculation. 
storage_size_per_document=mapping_table_size*number_of_revisions + 
depth*number_of_paths*number_of_revisions + 
number_of_paths*value_size*number_of_revisions

To clarify things a little bit I want to calculate space requirement for single 
revision this time.
mapping_table_size=field_name_size*(field_name_length+4(integer size))=100 * 
(20 + 4(integer size)) = 2400 bytes
storage_size_per_document_per_revision_per_replica=mapping_table_size + 
depth*number_of_paths + value_size*number_of_paths =
2400bytes + 10*1000+1000*100=112400bytes~=110 Kb

We definitely can reduce requirement for mapping table by adopting rnewson's 
idea of a schema.

On 2019/02/04 11:08:16, Ilya Khlopotov  wrote: 
> Hi Michael,
> 
> > For example, hears a crazy thought:
> > Map every distinct occurence of a key/value instance through a crypto hash
> > function to get a set of hashes.
> >
> > These can be be precomputed by Couch without any lookups in FDB.  These
> > will be spread all over kingdom come in FDB and not lend themselves to
> > range search well.
> > 
> > So what you do is index them for frequency of occurring in the same set.
> > In essence, you 'bucket them' statistically, and that bucket id becomes a
> > key prefix. A crypto hash value can be copied into more than one bucket.
> > The {bucket_id}/{cryptohash} becomes a {val_id}
> 
> > When writing a document, Couch submits the list/array of cryptohash values
> > it computed to FDB and gets back the corresponding  {val_id} (the id with
> > the bucket prefixed).  This can get somewhat expensive if there's always a
> > lot of app local cache misses.
> >
> > A document's value is then a series of {val_id} arrays up to 100k per
> > segment.
> > 
> > When retrieving a document, you get the val_ids, find the distinct buckets
> > and min/max entries for this doc, and then parallel query each bucket while
> > reconstructing the document.
> 
> Interesting idea. Let's try to think it through to see if we can make it 
> viable. 
> Let's go through hypothetical example. Input data for the example:
> - 1M of documents
> - each document is around 10Kb
> - each document consists of 1K of unique JSON paths 
> - each document has 100 unique JSON field names
> - every scalar value is 100 bytes
> - 10% of unique JSON paths for every document already stored in database 
> under different doc or different revision of the current one
> - we assume 3 independent copies for every key-value pair in FDB
> - our hash key size is 32 bytes
> - let's assume we can determine if key is already on the storage without 
> doing query
> - 1% of paths is in cache (unrealistic value, in real live the percentage is 
> lower)
> - every JSON field name is 20 bytes
> - every JSON path is 10 levels deep
> - document key prefix length is 50
> - every document has 10 revisions
> Let's estimate the storage requirements and size of data we need to transmit. 
> The calculations are not exact.
> 1. storage_size_per_document (we cannot estimate exact numbers since we don't 
> know how FDB stores it)
>   - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = 38Kb * 10 * 
> 3 = 1140 Kb (11x)
> 2. number of independent keys to retrieve on document read (non-range 
> queries) per document
>   - 1K - (1K * 1%) = 990
> 3. number of range queries: 0
> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) = 102 
> Kb (10x) 
> 5. read latency (we use 2ms per read based on numbers from 
> https://apple.github.io/foundationdb/performance.html)
> - sequential: 990*2ms = 1980ms 
> - range: 0
> Let's compare these numbers with initial proposal (flattened JSON docs 
> without global schema and without cache)
> 1. storage_size_per_document
>   - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
>   - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes 
>   - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = 2024K = 1976 
> Kb * 3 = 5930 Kb (59.3x)
> 2. number of independent keys to retrieve: 0-2 (depending on index structure)
> 3. number of range queries: 1 (1001 of keys in result)
> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb (2.4x)  
> 5. read latency (we use 2ms per read based on numbers from 
> https://apple.github.io/foundationdb/performance.html and estimate range read 
> performance based on numbers from 
> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test)
>   - range read performance: Given read performance is about 305,000 
> reads/second and range performance 3,600,000 keys/second we estimate range 
> performance to be 11.8x compared to read 

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

2019-02-04 Thread Robert Newson
I think we're deep in the weeds on this small aspect of the data model problem, 
and haven't touched other aspects yet. The numbers used in your example (1k of 
paths, 100 unique field names, 100 bytes for a value), where are they from? If 
they are not from some empirical data source, I don't see any reason to dwell 
on anything we might infer from them.

I think we should focus on the simplest model that also 'works' (i.e, delivers 
all essential properties) and then prototype so we can see how efficient it is.

I am happy to sacrifice some degree of efficiency for a comprehensible mapping 
of documents to key-value pairs and we have at least three techniques to 
address long keys so far. We also know there are other approaches to this 
problem if necessary that have a much smaller storage overhead (adjacent rows 
of 100k chunks of the couchdb document treated as a blob). 

-- 
  Robert Samuel Newson
  rnew...@apache.org

On Mon, 4 Feb 2019, at 18:29, Ilya Khlopotov wrote:
> At some point I changed the number of unique JSON paths and probably 
> forgot to update other conditions.
> The ` - each document is around 10Kb` is not used in the calculations so 
> can be ignored.
> 
> On 2019/02/04 17:46:20, Adam Kocoloski  wrote: 
> > Ugh! We definitely cannot have a model where a 10K JSON document is 
> > exploded into 2MB worth of KV data. I’ve tried several times to follow the 
> > math here but I’m failing. I can’t even get past this first bit:
> > 
> > > - each document is around 10Kb
> > > - each document consists of 1K of unique JSON paths 
> > > - each document has 100 unique JSON field names
> > > - every scalar value is 100 bytes
> > 
> > If each document has 1000 paths, and each path (which leads to a unique 
> > scalar value, right?) has a value of 100 bytes associated with it … how is 
> > the document 10KB? Wouldn’t it need to be at least 100KB just by adding up 
> > all the scalar values?
> > 
> > Adam
> > 
> > > On Feb 4, 2019, at 6:08 AM, Ilya Khlopotov  wrote:
> > > 
> > > Hi Michael,
> > > 
> > >> For example, hears a crazy thought:
> > >> Map every distinct occurence of a key/value instance through a crypto 
> > >> hash
> > >> function to get a set of hashes.
> > >> 
> > >> These can be be precomputed by Couch without any lookups in FDB.  These
> > >> will be spread all over kingdom come in FDB and not lend themselves to
> > >> range search well.
> > >> 
> > >> So what you do is index them for frequency of occurring in the same set.
> > >> In essence, you 'bucket them' statistically, and that bucket id becomes a
> > >> key prefix. A crypto hash value can be copied into more than one bucket.
> > >> The {bucket_id}/{cryptohash} becomes a {val_id}
> > > 
> > >> When writing a document, Couch submits the list/array of cryptohash 
> > >> values
> > >> it computed to FDB and gets back the corresponding  {val_id} (the id with
> > >> the bucket prefixed).  This can get somewhat expensive if there's always 
> > >> a
> > >> lot of app local cache misses.
> > >> 
> > >> A document's value is then a series of {val_id} arrays up to 100k per
> > >> segment.
> > >> 
> > >> When retrieving a document, you get the val_ids, find the distinct 
> > >> buckets
> > >> and min/max entries for this doc, and then parallel query each bucket 
> > >> while
> > >> reconstructing the document.
> > > 
> > > Interesting idea. Let's try to think it through to see if we can make it 
> > > viable. 
> > > Let's go through hypothetical example. Input data for the example:
> > > - 1M of documents
> > > - each document is around 10Kb
> > > - each document consists of 1K of unique JSON paths 
> > > - each document has 100 unique JSON field names
> > > - every scalar value is 100 bytes
> > > - 10% of unique JSON paths for every document already stored in database 
> > > under different doc or different revision of the current one
> > > - we assume 3 independent copies for every key-value pair in FDB
> > > - our hash key size is 32 bytes
> > > - let's assume we can determine if key is already on the storage without 
> > > doing query
> > > - 1% of paths is in cache (unrealistic value, in real live the percentage 
> > > is lower)
> > > - every JSON field name is 20 bytes
> > > - every JSON path is 10 levels deep
> > > - document key prefix length is 50
> > > - every document has 10 revisions
> > > Let's estimate the storage requirements and size of data we need to 
> > > transmit. The calculations are not exact.
> > > 1. storage_size_per_document (we cannot estimate exact numbers since we 
> > > don't know how FDB stores it)
> > >  - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = 38Kb * 
> > > 10 * 3 = 1140 Kb (11x)
> > > 2. number of independent keys to retrieve on document read (non-range 
> > > queries) per document
> > >  - 1K - (1K * 1%) = 990
> > > 3. number of range queries: 0
> > > 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) = 
> > > 102 Kb (10x) 
> > > 5. read latency (we use 2ms 

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

2019-02-04 Thread Ilya Khlopotov
At some point I changed the number of unique JSON paths and probably forgot to 
update other conditions.
The ` - each document is around 10Kb` is not used in the calculations so can be 
ignored.

On 2019/02/04 17:46:20, Adam Kocoloski  wrote: 
> Ugh! We definitely cannot have a model where a 10K JSON document is exploded 
> into 2MB worth of KV data. I’ve tried several times to follow the math here 
> but I’m failing. I can’t even get past this first bit:
> 
> > - each document is around 10Kb
> > - each document consists of 1K of unique JSON paths 
> > - each document has 100 unique JSON field names
> > - every scalar value is 100 bytes
> 
> If each document has 1000 paths, and each path (which leads to a unique 
> scalar value, right?) has a value of 100 bytes associated with it … how is 
> the document 10KB? Wouldn’t it need to be at least 100KB just by adding up 
> all the scalar values?
> 
> Adam
> 
> > On Feb 4, 2019, at 6:08 AM, Ilya Khlopotov  wrote:
> > 
> > Hi Michael,
> > 
> >> For example, hears a crazy thought:
> >> Map every distinct occurence of a key/value instance through a crypto hash
> >> function to get a set of hashes.
> >> 
> >> These can be be precomputed by Couch without any lookups in FDB.  These
> >> will be spread all over kingdom come in FDB and not lend themselves to
> >> range search well.
> >> 
> >> So what you do is index them for frequency of occurring in the same set.
> >> In essence, you 'bucket them' statistically, and that bucket id becomes a
> >> key prefix. A crypto hash value can be copied into more than one bucket.
> >> The {bucket_id}/{cryptohash} becomes a {val_id}
> > 
> >> When writing a document, Couch submits the list/array of cryptohash values
> >> it computed to FDB and gets back the corresponding  {val_id} (the id with
> >> the bucket prefixed).  This can get somewhat expensive if there's always a
> >> lot of app local cache misses.
> >> 
> >> A document's value is then a series of {val_id} arrays up to 100k per
> >> segment.
> >> 
> >> When retrieving a document, you get the val_ids, find the distinct buckets
> >> and min/max entries for this doc, and then parallel query each bucket while
> >> reconstructing the document.
> > 
> > Interesting idea. Let's try to think it through to see if we can make it 
> > viable. 
> > Let's go through hypothetical example. Input data for the example:
> > - 1M of documents
> > - each document is around 10Kb
> > - each document consists of 1K of unique JSON paths 
> > - each document has 100 unique JSON field names
> > - every scalar value is 100 bytes
> > - 10% of unique JSON paths for every document already stored in database 
> > under different doc or different revision of the current one
> > - we assume 3 independent copies for every key-value pair in FDB
> > - our hash key size is 32 bytes
> > - let's assume we can determine if key is already on the storage without 
> > doing query
> > - 1% of paths is in cache (unrealistic value, in real live the percentage 
> > is lower)
> > - every JSON field name is 20 bytes
> > - every JSON path is 10 levels deep
> > - document key prefix length is 50
> > - every document has 10 revisions
> > Let's estimate the storage requirements and size of data we need to 
> > transmit. The calculations are not exact.
> > 1. storage_size_per_document (we cannot estimate exact numbers since we 
> > don't know how FDB stores it)
> >  - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = 38Kb * 10 
> > * 3 = 1140 Kb (11x)
> > 2. number of independent keys to retrieve on document read (non-range 
> > queries) per document
> >  - 1K - (1K * 1%) = 990
> > 3. number of range queries: 0
> > 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) = 
> > 102 Kb (10x) 
> > 5. read latency (we use 2ms per read based on numbers from 
> > https://apple.github.io/foundationdb/performance.html)
> >- sequential: 990*2ms = 1980ms 
> >- range: 0
> > Let's compare these numbers with initial proposal (flattened JSON docs 
> > without global schema and without cache)
> > 1. storage_size_per_document
> >  - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
> >  - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes 
> >  - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = 2024K = 
> > 1976 Kb * 3 = 5930 Kb (59.3x)
> > 2. number of independent keys to retrieve: 0-2 (depending on index 
> > structure)
> > 3. number of range queries: 1 (1001 of keys in result)
> > 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb (2.4x)  
> > 5. read latency (we use 2ms per read based on numbers from 
> > https://apple.github.io/foundationdb/performance.html and estimate range 
> > read performance based on numbers from 
> > https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test)
> >  - range read performance: Given read performance is about 305,000 
> > reads/second and range performance 3,600,000 keys/second we estimate range 
> > 

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

2019-02-04 Thread Adam Kocoloski
Ugh! We definitely cannot have a model where a 10K JSON document is exploded 
into 2MB worth of KV data. I’ve tried several times to follow the math here but 
I’m failing. I can’t even get past this first bit:

> - each document is around 10Kb
> - each document consists of 1K of unique JSON paths 
> - each document has 100 unique JSON field names
> - every scalar value is 100 bytes

If each document has 1000 paths, and each path (which leads to a unique scalar 
value, right?) has a value of 100 bytes associated with it … how is the 
document 10KB? Wouldn’t it need to be at least 100KB just by adding up all the 
scalar values?

Adam

> On Feb 4, 2019, at 6:08 AM, Ilya Khlopotov  wrote:
> 
> Hi Michael,
> 
>> For example, hears a crazy thought:
>> Map every distinct occurence of a key/value instance through a crypto hash
>> function to get a set of hashes.
>> 
>> These can be be precomputed by Couch without any lookups in FDB.  These
>> will be spread all over kingdom come in FDB and not lend themselves to
>> range search well.
>> 
>> So what you do is index them for frequency of occurring in the same set.
>> In essence, you 'bucket them' statistically, and that bucket id becomes a
>> key prefix. A crypto hash value can be copied into more than one bucket.
>> The {bucket_id}/{cryptohash} becomes a {val_id}
> 
>> When writing a document, Couch submits the list/array of cryptohash values
>> it computed to FDB and gets back the corresponding  {val_id} (the id with
>> the bucket prefixed).  This can get somewhat expensive if there's always a
>> lot of app local cache misses.
>> 
>> A document's value is then a series of {val_id} arrays up to 100k per
>> segment.
>> 
>> When retrieving a document, you get the val_ids, find the distinct buckets
>> and min/max entries for this doc, and then parallel query each bucket while
>> reconstructing the document.
> 
> Interesting idea. Let's try to think it through to see if we can make it 
> viable. 
> Let's go through hypothetical example. Input data for the example:
> - 1M of documents
> - each document is around 10Kb
> - each document consists of 1K of unique JSON paths 
> - each document has 100 unique JSON field names
> - every scalar value is 100 bytes
> - 10% of unique JSON paths for every document already stored in database 
> under different doc or different revision of the current one
> - we assume 3 independent copies for every key-value pair in FDB
> - our hash key size is 32 bytes
> - let's assume we can determine if key is already on the storage without 
> doing query
> - 1% of paths is in cache (unrealistic value, in real live the percentage is 
> lower)
> - every JSON field name is 20 bytes
> - every JSON path is 10 levels deep
> - document key prefix length is 50
> - every document has 10 revisions
> Let's estimate the storage requirements and size of data we need to transmit. 
> The calculations are not exact.
> 1. storage_size_per_document (we cannot estimate exact numbers since we don't 
> know how FDB stores it)
>  - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = 38Kb * 10 * 
> 3 = 1140 Kb (11x)
> 2. number of independent keys to retrieve on document read (non-range 
> queries) per document
>  - 1K - (1K * 1%) = 990
> 3. number of range queries: 0
> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) = 102 
> Kb (10x) 
> 5. read latency (we use 2ms per read based on numbers from 
> https://apple.github.io/foundationdb/performance.html)
>- sequential: 990*2ms = 1980ms 
>- range: 0
> Let's compare these numbers with initial proposal (flattened JSON docs 
> without global schema and without cache)
> 1. storage_size_per_document
>  - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
>  - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes 
>  - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = 2024K = 1976 
> Kb * 3 = 5930 Kb (59.3x)
> 2. number of independent keys to retrieve: 0-2 (depending on index structure)
> 3. number of range queries: 1 (1001 of keys in result)
> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb (2.4x)  
> 5. read latency (we use 2ms per read based on numbers from 
> https://apple.github.io/foundationdb/performance.html and estimate range read 
> performance based on numbers from 
> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test)
>  - range read performance: Given read performance is about 305,000 
> reads/second and range performance 3,600,000 keys/second we estimate range 
> performance to be 11.8x compared to read performance. If read performance is 
> 2ms than range performance is 0.169ms (which is hard to believe).
>  - sequential: 2 * 2 = 4ms
>  - range: 0.169
> 
> It looks like we are dealing with a tradeoff:
> - Map every distinct occurrence of a key/value instance through a crypto hash:
>  - 5.39x more disk space efficient
>  - 474x slower
> - flattened JSON model
>  - 5.39x less efficient in disk space
>  - 474x 

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

2019-02-04 Thread Robert Newson
Hi,

The talk of crypto in the key space is extremely premature in my opinion. It it 
is the database's job (foundationdb's in this case) to map meaningful names to 
whatever it takes to efficiently store, index, and retrieve them. Obscuring 
every key with an expensive cryptographic operation works against everything I 
think distinguishes good software.

Keep it simple. The overhead of using readable, meaningful keys can be 
mitigated to a degree with a) the Directory layer which shortens prefixes at 
the cost of a network round trip b) prefix elision in the fdb storage system 
itself (Redwood, which may land before we've completed our work). 

Actual measurements take priority over the speculation in this thread so far, 
and overhead (defined as the actual storage of a document versus its 
theoretical minimum disk occupancy) is preferable to complicated, "clever", but 
brittle solutions.

I point to my earlier comment on optional document schemas which would reduce 
the length of keys to a scalar value anyway (the offset of the data item within 
the declared schema).

B.

-- 
  Robert Samuel Newson
  rnew...@apache.org

On Mon, 4 Feb 2019, at 11:08, Ilya Khlopotov wrote:
> Hi Michael,
> 
> > For example, hears a crazy thought:
> > Map every distinct occurence of a key/value instance through a crypto hash
> > function to get a set of hashes.
> >
> > These can be be precomputed by Couch without any lookups in FDB.  These
> > will be spread all over kingdom come in FDB and not lend themselves to
> > range search well.
> > 
> > So what you do is index them for frequency of occurring in the same set.
> > In essence, you 'bucket them' statistically, and that bucket id becomes a
> > key prefix. A crypto hash value can be copied into more than one bucket.
> > The {bucket_id}/{cryptohash} becomes a {val_id}
> 
> > When writing a document, Couch submits the list/array of cryptohash values
> > it computed to FDB and gets back the corresponding  {val_id} (the id with
> > the bucket prefixed).  This can get somewhat expensive if there's always a
> > lot of app local cache misses.
> >
> > A document's value is then a series of {val_id} arrays up to 100k per
> > segment.
> > 
> > When retrieving a document, you get the val_ids, find the distinct buckets
> > and min/max entries for this doc, and then parallel query each bucket while
> > reconstructing the document.
> 
> Interesting idea. Let's try to think it through to see if we can make it 
> viable. 
> Let's go through hypothetical example. Input data for the example:
> - 1M of documents
> - each document is around 10Kb
> - each document consists of 1K of unique JSON paths 
> - each document has 100 unique JSON field names
> - every scalar value is 100 bytes
> - 10% of unique JSON paths for every document already stored in database 
> under different doc or different revision of the current one
> - we assume 3 independent copies for every key-value pair in FDB
> - our hash key size is 32 bytes
> - let's assume we can determine if key is already on the storage without 
> doing query
> - 1% of paths is in cache (unrealistic value, in real live the 
> percentage is lower)
> - every JSON field name is 20 bytes
> - every JSON path is 10 levels deep
> - document key prefix length is 50
> - every document has 10 revisions
> Let's estimate the storage requirements and size of data we need to 
> transmit. The calculations are not exact.
> 1. storage_size_per_document (we cannot estimate exact numbers since we 
> don't know how FDB stores it)
>   - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = 38Kb * 
> 10 * 3 = 1140 Kb (11x)
> 2. number of independent keys to retrieve on document read (non-range 
> queries) per document
>   - 1K - (1K * 1%) = 990
> 3. number of range queries: 0
> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) = 
> 102 Kb (10x) 
> 5. read latency (we use 2ms per read based on numbers from 
> https://apple.github.io/foundationdb/performance.html)
> - sequential: 990*2ms = 1980ms 
> - range: 0
> Let's compare these numbers with initial proposal (flattened JSON docs 
> without global schema and without cache)
> 1. storage_size_per_document
>   - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
>   - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes 
>   - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = 2024K = 
> 1976 Kb * 3 = 5930 Kb (59.3x)
> 2. number of independent keys to retrieve: 0-2 (depending on index 
> structure)
> 3. number of range queries: 1 (1001 of keys in result)
> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb (2.4x)  
> 5. read latency (we use 2ms per read based on numbers from 
> https://apple.github.io/foundationdb/performance.html and estimate range 
> read performance based on numbers from 
> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test)
>   - range read performance: Given read performance is about 305,000 
> 

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

2019-02-04 Thread Ilya Khlopotov
Hi Michael,

> For example, hears a crazy thought:
> Map every distinct occurence of a key/value instance through a crypto hash
> function to get a set of hashes.
>
> These can be be precomputed by Couch without any lookups in FDB.  These
> will be spread all over kingdom come in FDB and not lend themselves to
> range search well.
> 
> So what you do is index them for frequency of occurring in the same set.
> In essence, you 'bucket them' statistically, and that bucket id becomes a
> key prefix. A crypto hash value can be copied into more than one bucket.
> The {bucket_id}/{cryptohash} becomes a {val_id}

> When writing a document, Couch submits the list/array of cryptohash values
> it computed to FDB and gets back the corresponding  {val_id} (the id with
> the bucket prefixed).  This can get somewhat expensive if there's always a
> lot of app local cache misses.
>
> A document's value is then a series of {val_id} arrays up to 100k per
> segment.
> 
> When retrieving a document, you get the val_ids, find the distinct buckets
> and min/max entries for this doc, and then parallel query each bucket while
> reconstructing the document.

Interesting idea. Let's try to think it through to see if we can make it 
viable. 
Let's go through hypothetical example. Input data for the example:
- 1M of documents
- each document is around 10Kb
- each document consists of 1K of unique JSON paths 
- each document has 100 unique JSON field names
- every scalar value is 100 bytes
- 10% of unique JSON paths for every document already stored in database under 
different doc or different revision of the current one
- we assume 3 independent copies for every key-value pair in FDB
- our hash key size is 32 bytes
- let's assume we can determine if key is already on the storage without doing 
query
- 1% of paths is in cache (unrealistic value, in real live the percentage is 
lower)
- every JSON field name is 20 bytes
- every JSON path is 10 levels deep
- document key prefix length is 50
- every document has 10 revisions
Let's estimate the storage requirements and size of data we need to transmit. 
The calculations are not exact.
1. storage_size_per_document (we cannot estimate exact numbers since we don't 
know how FDB stores it)
  - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = 38Kb * 10 * 3 
= 1140 Kb (11x)
2. number of independent keys to retrieve on document read (non-range queries) 
per document
  - 1K - (1K * 1%) = 990
3. number of range queries: 0
4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) = 102 Kb 
(10x) 
5. read latency (we use 2ms per read based on numbers from 
https://apple.github.io/foundationdb/performance.html)
- sequential: 990*2ms = 1980ms 
- range: 0
Let's compare these numbers with initial proposal (flattened JSON docs without 
global schema and without cache)
1. storage_size_per_document
  - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
  - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes 
  - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = 2024K = 1976 
Kb * 3 = 5930 Kb (59.3x)
2. number of independent keys to retrieve: 0-2 (depending on index structure)
3. number of range queries: 1 (1001 of keys in result)
4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb (2.4x)  
5. read latency (we use 2ms per read based on numbers from 
https://apple.github.io/foundationdb/performance.html and estimate range read 
performance based on numbers from 
https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test)
  - range read performance: Given read performance is about 305,000 
reads/second and range performance 3,600,000 keys/second we estimate range 
performance to be 11.8x compared to read performance. If read performance is 
2ms than range performance is 0.169ms (which is hard to believe).
  - sequential: 2 * 2 = 4ms
  - range: 0.169

It looks like we are dealing with a tradeoff:
- Map every distinct occurrence of a key/value instance through a crypto hash:
  - 5.39x more disk space efficient
  - 474x slower
- flattened JSON model
  - 5.39x less efficient in disk space
  - 474x faster

In any case this unscientific exercise was very helpful. Since it uncovered the 
high cost in terms of disk space. 59.3x of original disk size is too much IMO. 

Are the any ways we can make Michael's model more performant?

Also I don't quite understand few aspects of the global hash table proposal:

1. > - Map every distinct occurence of a key/value instance through a crypto 
hash function to get a set of hashes.
I think we are talking only about scalar values here? I.e. `"#/foo.bar.baz": 
123`
Since I don't know how we can make it work for all possible JSON paths `{"foo": 
{"bar": {"size": 12, "baz": 123}}}":
- foo
- foo.bar
- foo.bar.baz

2. how to delete documents

Best regards,
ILYA


On 2019/01/30 23:33:22, Michael Fair  wrote: 
> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski  
> > Hi Michael,
> >
> > > The trivial