Re: [PERFORM] Storing large documents - one table or partition by doc?

2016-09-23 Thread Jim Nasby

On 9/23/16 7:14 AM, Mike Sofen wrote:

So with proper indexing, I can’t see where there will be a performance
issue.


Table bloat could become problematic. If there is a pattern where you 
can predict which documents are likely to be active (say, documents that 
have been modified in the last 10 days), then you can keep all of those 
in a set of tables that is fairly small, and keep the remaining 
documents in a set of "archive" tables. That will help reduce bloat in 
the large archive tables. Before putting in that extra work though, I'd 
just try the simple solution and see how well it works.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Strange nested loop for an INSERT

2016-09-23 Thread Jim Nasby

On 9/23/16 12:59 PM, phb07 wrote:


Le 21/09/2016 à 23:42, Jim Nasby a écrit :

On 9/12/16 1:05 PM, phb07 wrote:

The drawback is the overhead of this added ANALYZE statement. With a
heavy processing like in this test case, it is worth to be done. But for
common cases, it's a little bit expensive.


You could always look at the number of rows affected by a command and
make a decision on whether to ANALYZE based on that, possibly by
looking at pg_stat_all_tables.n_mod_since_analyze.

I have solved the issue by adding an ANALYZE between both statements. To
avoid the associated overhead for cases when it is not worth to be done,
the ANALYZE is only performed when more than 1000 rows have just been
deleted by the first statement (as the logic is embeded into a plpgsql
function, the GET DIAGNOSTICS statement provides the information). This
threshold is approximately the point where the potential loss due to bad
estimates equals the ANALYZE cost.
But the idea of using the n_mod_since_analyze data to also take into
account other recent updates not yet reflected into the statistics is
very interesting.


Another interesting possibility would be to look at 
pg_catalog.pg_stat_xact_all_tables; if you add n_tup_ins, _upd, and _del 
that will tell you how much n_mod_since_analyze will be increased when 
your transaction commits, so you could guage exactly how much the 
current transaction has changed things.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Strange nested loop for an INSERT

2016-09-23 Thread phb07


Le 21/09/2016 à 23:42, Jim Nasby a écrit :

On 9/12/16 1:05 PM, phb07 wrote:

The drawback is the overhead of this added ANALYZE statement. With a
heavy processing like in this test case, it is worth to be done. But for
common cases, it's a little bit expensive.


You could always look at the number of rows affected by a command and 
make a decision on whether to ANALYZE based on that, possibly by 
looking at pg_stat_all_tables.n_mod_since_analyze.
I have solved the issue by adding an ANALYZE between both statements. To 
avoid the associated overhead for cases when it is not worth to be done, 
the ANALYZE is only performed when more than 1000 rows have just been 
deleted by the first statement (as the logic is embeded into a plpgsql 
function, the GET DIAGNOSTICS statement provides the information). This 
threshold is approximately the point where the potential loss due to bad 
estimates equals the ANALYZE cost.
But the idea of using the n_mod_since_analyze data to also take into 
account other recent updates not yet reflected into the statistics is 
very interesting.


Thanks.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Storing large documents - one table or partition by doc?

2016-09-23 Thread Mike Sofen
From: Dev Nop  Sent: Friday, September 23, 2016 3:12 AM
I’m storing thousands of independent documents each containing around 20k rows. 
The larger the document, the more likely it is to be active with inserts and 
updates (1000s/day). The most common read query is to get all the rows for a 
single document (100s/day). It will be supporting real-time collaboration but 
with strong-consistency for a simple schema so not well-suited to dedicated 
"document databases" that assume schema-less & eventual consistency. I won’t 
have great hardware/budget so need to squeeze the most out of the least.

 

My question is whether to put all documents into a single huge table or 
partition by document?

 

The documents are independent so its purely a performance question. Its too 
many tables for postgresql partitioning support but I don’t get any benefit 
from a master table and constraints. Handling partitioning in application logic 
is effectively zero cost.

 

I know that 1000s of tables is regarded as an anti-pattern but I can only see 
the performance and maintenance benefits of one table per independent document 
e.g. fast per-table vacuum, incremental schema updates, easy future sharding. A 
monster table will require additional key columns and indexes that don’t have 
any value beyond allowing the documents to sit in the same table.

 

The only downsides seem to be the system level per-table overhead but I only 
see that as a problem if I have a very long tail of tiny documents. I'd rather 
solve that problem if it occurs than manage an all-eggs-in-one-basket monster 
table.


Is there anything significant I am missing in my reasoning? Is it mostly a 
“relational purist” perspective that argues against multiple tables? Should I 
be looking at alternative tech for this problem?

 

The one factor I haven't fully resolved is how much a caching layer in front of 
the database changes things.

 

Thanks for your help.

-

This is, to me, a very standard, almost classic, relational pattern, and one 
that a relational engine handles extremely well, especially the consistency and 
locking needed to support lots of updates.  Inserts are irrelevant unless the 
parent record must be locked to do so…that would be a bad design.

 

Imagine a normal parent-child table pair, 1:M, with the 20k rows per parent 
document in the child table.  Unless there’s something very bizarre about the 
access patterns against that child table, those 20k rows per document would not 
normally all be in play for every user on every access throughout that access 
(it’s too much data to show on a web page, for instance).  Even so, at “100s” 
of large queries per day, it’s a trivial load unless each child row contains a 
large json blob…which doesn’t jive with your table description.

 

So with proper indexing, I can’t see where there will be a performance issue.   
Worst case, you create a few partitions based on some category, but the row 
counts you’re describing don’t yet warrant it.  I’m running a few hundred 
million rows in a new “child” table on a dev server (4 cores/16gb ram) with 
large json documents in each row and it’s still web page performant on normal 
queries, using a paging model (say 20 full rows per web page request).  The 
critical pieces, hardware-wise, are memory (buy as much as you can afford) and 
using SSDs (required, IMO).  It’s much harder to create measurable loads on the 
CPUs.  Amazon has memory optimized EC2 instances that support that pattern 
(with SSD storage).

 

Are there other issues/requirements that are creating other performance 
concerns that aren’t obvious in your initial post?

 

Mike Sofen (Synthetic Genomics)



[PERFORM] Storing large documents - one table or partition by doc?

2016-09-23 Thread Dev Nop
I’m storing thousands of independent documents each containing around 20k
rows. The larger the document, the more likely it is to be active with
inserts and updates (1000s/day). The most common read query is to get all
the rows for a single document (100s/day). It will be supporting real-time
collaboration but with strong-consistency for a simple schema so not
well-suited to dedicated "document databases" that assume schema-less &
eventual consistency. I won’t have great hardware/budget so need to squeeze
the most out of the least.

My question is whether to put all documents into a single huge table or
partition by document?

The documents are independent so its purely a performance question. Its too
many tables for postgresql partitioning support but I don’t get any benefit
from a master table and constraints. Handling partitioning in application
logic is effectively zero cost.

I know that 1000s of tables is regarded as an anti-pattern but I can only
see the performance and maintenance benefits of one table per independent
document e.g. fast per-table vacuum, incremental schema updates, easy
future sharding. A monster table will require additional key columns and
indexes that don’t have any value beyond allowing the documents to sit in
the same table.

The only downsides seem to be the system level per-table overhead but I
only see that as a problem if I have a very long tail of tiny documents.
I'd rather solve that problem if it occurs than manage an
all-eggs-in-one-basket monster table.

Is there anything significant I am missing in my reasoning? Is it mostly a
“relational purist” perspective that argues against multiple tables? Should
I be looking at alternative tech for this problem?

The one factor I haven't fully resolved is how much a caching layer in
front of the database changes things.

Thanks for your help.