Re: [GENERAL] EAV Designs for Multi-Tenant Applications

2017-10-15 Thread Alban Hertroys

> On 12 Oct 2017, at 16:48, Nic Pottier  wrote:

…

> We have thousands of organizations, some organizations have hundreds of 
> custom fields. Some organizations have millions of contacts. We have tens of 
> millions of values across organizations, so relatively sparse with a large 
> number of tiny / unused organizations. All these numbers will continue to 
> grow, though single organizations greater than 10 million contacts is 
> unlikely.
> 
> What is the recommended way of modeling this in order to allow performant 
> queries on the custom fields?
> 
> Right now we basically use an EAV model (a single contact_fields table) with 
> compound indexes that join the field key and field value so as to force 
> locality in the index. That has worked ok, but occasionally falls on its face 
> when the query planner makes a wrong guess due to the (obviously skewed) 
> statistics it has to work with. Multi-field queries can also get painful, 
> especially on the largest organizations with millions of contacts.
> 
> What other approaches should we be looking at? We've brainstormed different 
> approaches but would love some wisdom to help us narrow down what are 
> reasonable things to try testing out. Most of our designs hone in on creating 
> a table per organizations to hold field values and dynamically creating 
> indexes on that. The idea being that we won't kill our insert performance as 
> much by only having one index per field to check on inserts and table 
> statistics should still be ok. (how are statistics managed for JSONB fields?) 
> The main question we have is what is going to happen if we have thousands (or 
> tens of thousands) of tables on a single database? The good news is the vast 
> majority of our organizations are idle at any point in time.

The table per customer approach is probably on the right track. It means you 
don't have to bother the query planner with fields that are specific to only a 
few organisations when they don't apply to the currently queried organisation.

I would go one step further and put all the fields common across all 
organisations into one master table and inherit that in each organisations 
specific table. If you add a check constraint on the inheriting table's 
organisation id, I expect that you can even make use of constraint exclusion. 
That basically turns the master table into a partitioned table, where each 
partition has it's own fields. I'm not 100% certain that constraint exclusion 
can work when partitions have different layouts, but I don't see why not - as 
long as the partitioning (check) constraint is on a field that is also in the 
master table.

> Approaches we've thought about and questions / pros / cons:
> 
> 1) Add a JSONB field on our (shared across organizations) contact table, 
> store field values there. Create JSONB indexes per unique field. 
>pros: nice having the locality of data on contact, multi field queries are 
> likely way better, query performance should be good
>cons: we have to create thousands of indexes? we have to use uuids as keys 
> to keep our indexes org-specific? insert performance suffers from having 
> thousands of partial indexes (how badly)?
> 
> 2) Create a table per organization `contact_fields_[org_id]` containing a 
> column per field. Create columns and indexes per unique field.
>pros: locality is nice again, multi field queries are better, query and 
> insert performance should be good.
>cons: thousands of tables with up to 100 indexes per, is that going to 
> blow up?

Perhaps not so much as you think. If an organisation has 100s of fields, they 
are probably not using a normalised view of their data. Since you're using 
separate columns now, each column can have its own data type (another con), and 
that adds the possibility to add references to/from other tables.

For example, if an organisation created fields address1 to address10, you can 
instead have an address table that references the organisation's specific 
contact partition, allowing them to add as many addresses as they need. If more 
organisations use addresses (very likely), you can use the same partitioning 
approach for each organisation's address table and put a foreign key constraint 
on each to the organisation's contact partition.
That puts multiple of your original fields under a single index.

I seem to recall that's pretty much how ToroDB goes about organising 
unstructured data on top of PG, although they probably use a couple more tricks 
than just that.

In fact, you could already apply this to your current design, although you 
probably wouldn't gain as much from it.

> What other approaches should we be considering? I know EAV is a pain, 
> especially in multi-tenant situations, but we'd love to hear success (and 
> failure) stories from the community on how they've dealt with these.

I'm not that familiar with JSONB, but wouldn't it be possible to create an 
index over an 

[GENERAL] EAV Designs for Multi-Tenant Applications

2017-10-12 Thread Nic Pottier
Howdy all,

We maintain a hosted multi-tenant system for a large number of users. Each
user has what we call an "organization" or "org" which is their sandbox. In
that organization they can manage contacts and define custom fields on
those contacts, which then have values per contact.

We have thousands of organizations, some organizations have hundreds of
custom fields. Some organizations have millions of contacts. We have tens
of millions of values across organizations, so relatively sparse with a
large number of tiny / unused organizations. All these numbers will
continue to grow, though single organizations greater than 10 million
contacts is unlikely.

What is the recommended way of modeling this in order to allow performant
queries on the custom fields?

Right now we basically use an EAV model (a single contact_fields table)
with compound indexes that join the field key and field value so as to
force locality in the index. That has worked ok, but occasionally falls on
its face when the query planner makes a wrong guess due to the (obviously
skewed) statistics it has to work with. Multi-field queries can also get
painful, especially on the largest organizations with millions of contacts.

What other approaches should we be looking at? We've brainstormed different
approaches but would love some wisdom to help us narrow down what are
reasonable things to try testing out. Most of our designs hone in on
creating a table per organizations to hold field values and dynamically
creating indexes on that. The idea being that we won't kill our insert
performance as much by only having one index per field to check on inserts
and table statistics should still be ok. (how are statistics managed for
JSONB fields?) The main question we have is what is going to happen if we
have thousands (or tens of thousands) of tables on a single database? The
good news is the vast majority of our organizations are idle at any point
in time.

Approaches we've thought about and questions / pros / cons:

1) Add a JSONB field on our (shared across organizations) contact table,
store field values there. Create JSONB indexes per unique field.
   pros: nice having the locality of data on contact, multi field queries
are likely way better, query performance should be good
   cons: we have to create thousands of indexes? we have to use uuids as
keys to keep our indexes org-specific? insert performance suffers from
having thousands of partial indexes (how badly)?

2) Create a table per organization `contact_fields_[org_id]` containing a
column per field. Create columns and indexes per unique field.
   pros: locality is nice again, multi field queries are better, query and
insert performance should be good.
   cons: thousands of tables with up to 100 indexes per, is that going to
blow up?

3) Create a table per organization `contacts_fields_[org_id]` which
contains a `fields` JSONB column, Create JSONB indexes per unique field.
  pros: locality is good, multi field queries good, query performance
should be good. Adding and removing fields is a bit simpler than 2) case
above and naming can be a bit clearer than 1) as we don't have to worry
about multi-org key name collisions
  cons: same as 2) but with a JSONB flavor, hundreds of JSONB indexes on
thousands of tables, thousands of tables

4) Create a database per organization? Use 1) above
  pros: all data is localized, might see performance improvements
elsewhere, query and insert performance should be good
  cons: mother of a refactor :)  what happens with thousands of databases
on a single box? is this actually better from the perspective of getting to
offload currently inactive orgs?

What other approaches should we be considering? I know EAV is a pain,
especially in multi-tenant situations, but we'd love to hear success (and
failure) stories from the community on how they've dealt with these.

Cheers,

-Nic