Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-29 Thread Jeff Janes
On Thu, Sep 29, 2016 at 11:12 AM, Pavel Stehule 
wrote:

>
>
> 2016-09-29 14:20 GMT+02:00 Sven R. Kunze :
>
>> On 23.09.2016 11:00, Pavel Stehule wrote:
>>
>> 2016-09-23 8:35 GMT+02:00 Sven R. Kunze :
>>
>>> I was wondering: would it be possible for PostgreSQL to rewrite the
>>> query to generate the UNION (or subquery plan if it's also fast) on it's
>>> own?
>>>
>>
>> It depends on real data. On your specific data the UNION variant is
>> pretty fast, on different set, the UNION can be pretty slow. It is related
>> to difficult OR predicate estimation.
>>
>>
>> I figure that the UNION is fast if the sub-results are small (which they
>> are in our case). On the contrary, when they are huge, the OUTER JOIN
>> variant might be preferable.
>>
>>
>> Is there something I can do to help here?
>>
>> Or do you think it's naturally application-dependent and thus should be
>> solved with application logic just as we did?
>>
>
> In ideal world then plan should be independent on used form. The most
> difficult is safe estimation of OR predicates. With correct estimation the
> transformation to UNION form should not be necessary I am think.
>

I don't think it is an estimation issue.  If it were, the planner would
always choose the same inefficient plan (providing the join collapse
limits, etc. don't come into play, which I don't think they do here) for
all the different ways of writing the query.

Since that is not happening, the planner must not be able to prove that the
different queries are semantically identical to each other, which means
that it can't pick the other plan no matter how good the estimates look.

Cheers,

Jeff


Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-29 Thread Jeff Janes
On Thu, Sep 29, 2016 at 11:48 AM, Sven R. Kunze  wrote:

> On 29.09.2016 20:03, Jeff Janes wrote:
>
> Perhaps some future version of PostgreSQL could do so, but my gut feeling
> is that that is not very likely.  It would take a lot of work, would risk
> breaking or slowing down other things, and is probably too much of a niche
> issue to attract a lot of interest.
>
>
> I don't hope so; in business and reports/stats applications there is a lot
> of room for this.
>
> Why do you think that OR-ing several tables is a niche issue? I can at
> least name 3 different projects (from 3 different domains) where combining
> 3 or more tables with OR is relevant and should be reasonably fast.
>

Well, I don't recall seeing this issue on this list before (or a few other
forums I read) while I see several other issues over and over again.  So
that is why I think it is a niche issue.  Perhaps I've have seen it before
and just forgotten, or have not recognized it as being the same issue each
time.



> This multitude of solution also shows that applications developers might
> be overwhelmed by choosing the most appropriate AND most long-lasting one.
> Because what I take from the discussion is that a UNION might be
> appropriate right now but that could change in the future even for the very
> same use-case at hand.
>

I'm not sure what would cause it to change.  Do you mean if you suddenly
start selecting a much larger portion of the table?  I don't know that the
union would be particularly bad in that case, either.

I'm not saying it wouldn't be nice to fix it.  I just don't think it is
particularly likely to happen soon.  I could be wrong (especially if you
can write the code to make it happen).

Cheers,

Jeff


Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-29 Thread Pavel Stehule
2016-09-29 20:49 GMT+02:00 Sven R. Kunze :

> On 29.09.2016 20:12, Pavel Stehule wrote:
>
>> In ideal world then plan should be independent on used form. The most
>> difficult is safe estimation of OR predicates. With correct estimation the
>> transformation to UNION form should not be necessary I am think.
>>
>
> Ah, okay. That's interesting.
>
> So how can I help here?
>

try to write a patch :) or better, help with enhancing PostgreSQL's
estimation model. Tomas Vondra is working 2 years on multicolumn
statistics. He needs help with review.

Regards

Pavel

>
> Regards,
> Sven
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-29 Thread Sven R. Kunze

On 29.09.2016 20:12, Pavel Stehule wrote:
In ideal world then plan should be independent on used form. The most 
difficult is safe estimation of OR predicates. With correct estimation 
the transformation to UNION form should not be necessary I am think.


Ah, okay. That's interesting.

So how can I help here?

Regards,
Sven


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-29 Thread Sven R. Kunze

Hi Jeff,

On 29.09.2016 20:03, Jeff Janes wrote:
I don't know what the subquery plan is, I don't see references to that 
in the email chain.


Lutz posted the following solution:

SELECT * FROM big_table
WHERE
 id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" 
IN ())

OR
 id in (SELECT big_table_id FROM table_a WHERE "table_b"."item_id" 
IN ())


I don't believe that current versions of PostgreSQL are capable of 
rewriting the plan in the style of a union.  It is not just a matter 
of tweaking the cost estimates, it simply never considers such a plan 
in the first place given the text of your query.


That's okay and that's why I am asking here. :)

Perhaps some future version of PostgreSQL could do so, but my gut 
feeling is that that is not very likely.  It would take a lot of work, 
would risk breaking or slowing down other things, and is probably too 
much of a niche issue to attract a lot of interest.


I don't hope so; in business and reports/stats applications there is a 
lot of room for this.


Why do you think that OR-ing several tables is a niche issue? I can at 
least name 3 different projects (from 3 different domains) where 
combining 3 or more tables with OR is relevant and should be reasonably 
fast.


Most domains that could benefit would probably have star-like schemas. 
So, big_table corresponds to the center of the star, whereas the rays 
correspond to various (even dynamic) extensions to the base data structure.



Why not just use the union?


Sure that would work in this particular case. However, this thread 
actually sought a general answer to "how to OR more than two tables".


Are you using a framework which generates the query automatically and 
you have no control over it?


We use a framework and we can use the UNION if we want to.


Or do you just think it is ugly or fragile for some other reason?


I don't think it's ugly or fragile. I am just used to the fact that **if 
it's equivalent** then PostgreSQL can figure it out (without constant 
supervision from application developers).


So, it's just a matter of inconvenience. ;)

Perhaps moving the union from the outside to the inside would be more 
suitable?  That way teh select list is only specified once, and if you 
AND more clauses into the WHERE condition they also only need to be 
specified once.


SELECT * FROM big_table
WHERE
 id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" 
IN () union
 SELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN 
()

);


Yet another solution I guess, so thanks a lot. :)

This multitude of solution also shows that applications developers might 
be overwhelmed by choosing the most appropriate AND most long-lasting 
one. Because what I take from the discussion is that a UNION might be 
appropriate right now but that could change in the future even for the 
very same use-case at hand.


Cheers,
Sven


Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-29 Thread Pavel Stehule
2016-09-29 14:20 GMT+02:00 Sven R. Kunze :

> On 23.09.2016 11:00, Pavel Stehule wrote:
>
> 2016-09-23 8:35 GMT+02:00 Sven R. Kunze :
>
>> I was wondering: would it be possible for PostgreSQL to rewrite the query
>> to generate the UNION (or subquery plan if it's also fast) on it's own?
>>
>
> It depends on real data. On your specific data the UNION variant is pretty
> fast, on different set, the UNION can be pretty slow. It is related to
> difficult OR predicate estimation.
>
>
> I figure that the UNION is fast if the sub-results are small (which they
> are in our case). On the contrary, when they are huge, the OUTER JOIN
> variant might be preferable.
>
>
> Is there something I can do to help here?
>
> Or do you think it's naturally application-dependent and thus should be
> solved with application logic just as we did?
>

In ideal world then plan should be independent on used form. The most
difficult is safe estimation of OR predicates. With correct estimation the
transformation to UNION form should not be necessary I am think.

Regards

Pavel


>
> Cheers,
> Sven
>


Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-29 Thread Jeff Janes
On Thu, Sep 22, 2016 at 11:35 PM, Sven R. Kunze  wrote:

> Thanks a lot Madusudanan, Igor, Lutz and Jeff for your suggestions.
>
> What I can confirm is that the UNION ideas runs extremely fast (don't have
> access to the db right now to test the subquery idea, but will check next
> week as I travel right now). Thanks again! :)
>
>
> I was wondering: would it be possible for PostgreSQL to rewrite the query
> to generate the UNION (or subquery plan if it's also fast) on it's own?
>

I don't know what the subquery plan is, I don't see references to that in
the email chain.

I don't believe that current versions of PostgreSQL are capable of
rewriting the plan in the style of a union.  It is not just a matter of
tweaking the cost estimates, it simply never considers such a plan in the
first place given the text of your query.

Perhaps some future version of PostgreSQL could do so, but my gut feeling
is that that is not very likely.  It would take a lot of work, would risk
breaking or slowing down other things, and is probably too much of a niche
issue to attract a lot of interest.

Why not just use the union?  Are you using a framework which generates the
query automatically and you have no control over it?  Or do you just think
it is ugly or fragile for some other reason?

Perhaps moving the union from the outside to the inside would be more
suitable?  That way teh select list is only specified once, and if you AND
more clauses into the WHERE condition they also only need to be specified
once.

SELECT * FROM big_table
WHERE
 id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" IN
() union
 SELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN
()
  );


Cheers,

Jeff


Re: [PERFORM] Failing Multi-Job Restores, Missing Indexes on Restore

2016-09-29 Thread Tom Lane
Cea Stapleton  writes:
> We’re using pg_restore (PostgreSQL) 9.5.4 for the restores. We’ve used 
> variations on the job number:
> /usr/bin/pg_restore -j 6 -Fc -O -c -d DBNAME RESTORE_FILE”

OK ... do you actually need the -c, and if so why?

> We’ll take a look at the memory overcommit - would that also explain the 
> index issues we were seeing before we were seeing the crashes?

Unlikely.  I'm guessing that there's some sort of race condition involved
in parallel restore with -c, but it's not very clear what.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Failing Multi-Job Restores, Missing Indexes on Restore

2016-09-29 Thread Cea Stapleton
Thanks Tom!

We’re using pg_restore (PostgreSQL) 9.5.4 for the restores. We’ve used 
variations on the job number:

/usr/bin/pg_restore -j 6 -Fc -O -c -d DBNAME RESTORE_FILE”

We’ll take a look at the memory overcommit - would that also explain the index 
issues we were seeing before we were seeing the crashes?

Cea Stapleton 
Operations Engineer
http://www.healthfinch.com


> On Sep 29, 2016, at 7:52 AM, Tom Lane  wrote:
> 
> Cea Stapleton  writes:
>> We are having a baffling problem we hope you might be able to help with. We 
>> were hoping to speed up postgres restores to our reporting server. First, we 
>> were seeing missing indexes with pg_restore to our reporting server for one 
>> of our databases when we did pg_restore with multiple jobs (a clean restore, 
>> we also tried dropping the database prior to restore, just in case something 
>> was extant and amiss). The indexes missed were not consistent, and we were 
>> only ever seeing errors on import that indicated an index had not yet been 
>> built. For example:
> 
>> pg_restore: [archiver (db)] could not execute query: ERROR:  index 
>> "index_versions_on_item_type_and_item_id" does not exist
>>   Command was: DROP INDEX public.index_versions_on_item_type_and_item_id;
> 
> Which PG version is that; particularly, which pg_restore version?
> What's the exact pg_restore command you were issuing?
> 
>> We decided to move back to a multi-job regular restore, and then the 
>> restores began crashing thusly:
>> [2016-09-14 02:20:36 UTC]LOG:  server process (PID 27624) was terminated 
>> by signal 9: Killed
> 
> This is probably the dreaded Linux OOM killer.  Fix by reconfiguring your
> system to disallow memory overcommit, or at least make it not apply to
> Postgres, cf
> https://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
> 
>   regards, tom lane



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Failing Multi-Job Restores, Missing Indexes on Restore

2016-09-29 Thread Tom Lane
Cea Stapleton  writes:
> We are having a baffling problem we hope you might be able to help with. We 
> were hoping to speed up postgres restores to our reporting server. First, we 
> were seeing missing indexes with pg_restore to our reporting server for one 
> of our databases when we did pg_restore with multiple jobs (a clean restore, 
> we also tried dropping the database prior to restore, just in case something 
> was extant and amiss). The indexes missed were not consistent, and we were 
> only ever seeing errors on import that indicated an index had not yet been 
> built. For example:

> pg_restore: [archiver (db)] could not execute query: ERROR:  index 
> "index_versions_on_item_type_and_item_id" does not exist
>Command was: DROP INDEX public.index_versions_on_item_type_and_item_id;

Which PG version is that; particularly, which pg_restore version?
What's the exact pg_restore command you were issuing?

> We decided to move back to a multi-job regular restore, and then the restores 
> began crashing thusly:
> [2016-09-14 02:20:36 UTC]LOG:  server process (PID 27624) was terminated 
> by signal 9: Killed

This is probably the dreaded Linux OOM killer.  Fix by reconfiguring your
system to disallow memory overcommit, or at least make it not apply to
Postgres, cf
https://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Failing Multi-Job Restores, Missing Indexes on Restore

2016-09-29 Thread Cea Stapleton
Hi!

We are having a baffling problem we hope you might be able to help with. We 
were hoping to speed up postgres restores to our reporting server. First, we 
were seeing missing indexes with pg_restore to our reporting server for one of 
our databases when we did pg_restore with multiple jobs (a clean restore, we 
also tried dropping the database prior to restore, just in case something was 
extant and amiss). The indexes missed were not consistent, and we were only 
ever seeing errors on import that indicated an index had not yet been built. 
For example:

pg_restore: [archiver (db)] could not execute query: ERROR:  index 
"index_versions_on_item_type_and_item_id" does not exist
   Command was: DROP INDEX public.index_versions_on_item_type_and_item_id;

Which seemed like a reasonable error to us. We had no errors on insertion to 
indicate that index creation was a problem. 

We believed this might be a race condition, so we attempted to do a schema-only 
restore followed by a data-only restore just for this database. This worked a 
few times, and then began growing exponentially in completion time before it 
became unsustainable. We figured we were using too many jobs, so we decreased 
them. Nothing helped.

We decided to move back to a multi-job regular restore, and then the restores 
began crashing thusly:
[2016-09-14 02:20:36 UTC]LOG:  server process (PID 27624) was terminated by 
signal 9: Killed
[2016-09-14 02:20:36 UTC]LOG:  terminating any other active server processes
[2016-09-14 02:20:36 UTC] postgres [local] DBNAME WARNING:  terminating 
connection because of crash of another server process
[2016-09-14 02:20:36 UTC] postgres [local] DBNAME DETAIL:  The postmaster has 
commanded this server process to roll back the current transaction and exit, 
because another server process exited abnormally and possibly corrupted shared 
memory.

The restore crashed this way for all job numbers except for one. We’re now 
stuck back where we were prior to increasing job numbers, at one job for this 
restore in order to prevent errors and crashes.  

Background: 
• 3 ec2 instances with postgres
• 1 used for reporting, on Postgresql 9.5.4
• Reporting server is a c4.2xlarge, and should have 
been able to handle multiple jobs (8cpu / 
https://aws.amazon.com/ec2/instance-types/ )
• 2 production servers; one leader and one follower, both on 
Postgresql 9.5.3. 

We have one very large database, 678GB, and several others, but the largest is 
our concern. 

I have attached our postgresql.conf file. Thank you so much for your time.

Best,



Cea Stapleton 
Operations Engineer
http://www.healthfinch.com




postgresql.conf
Description: Binary data



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

2016-09-29 Thread Sven R. Kunze

On 23.09.2016 11:00, Pavel Stehule wrote:
2016-09-23 8:35 GMT+02:00 Sven R. Kunze >:


I was wondering: would it be possible for PostgreSQL to rewrite
the query to generate the UNION (or subquery plan if it's also
fast) on it's own?


It depends on real data. On your specific data the UNION variant is 
pretty fast, on different set, the UNION can be pretty slow. It is 
related to difficult OR predicate estimation.


I figure that the UNION is fast if the sub-results are small (which they 
are in our case). On the contrary, when they are huge, the OUTER JOIN 
variant might be preferable.



Is there something I can do to help here?

Or do you think it's naturally application-dependent and thus should be 
solved with application logic just as we did?



Cheers,
Sven


Re: [PERFORM] Millions of tables

2016-09-29 Thread Alex Ignatov (postgrespro)
 

 

 

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Greg Spiegelberg
Sent: Tuesday, September 27, 2016 7:28 PM
To: Terry Schmitt 
Cc: pgsql-performa. 
Subject: Re: [PERFORM] Millions of tables

 

On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmitt  > wrote:

 

 

On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg  > wrote:

Hey all,

 

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has 
said not to have millions of tables.  I too have long believed it until 
recently.

 

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for 
PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those 
tables.  Table creation initially took 0.018031 secs, average 0.027467 and 
after tossing out outliers (qty 5) the maximum creation time found was 0.66139 
seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were 
created by a single process.  Do note that table creation is done via plpgsql 
function as there are other housekeeping tasks necessary though minimal.

 

No system tuning but here is a list of PostgreSQL knobs and switches:

shared_buffers = 2GB

work_mem = 48 MB

max_stack_depth = 4 MB

synchronous_commit = off

effective_cache_size = 200 GB

pg_xlog is on it's own file system

 

There are some still obvious problems.  General DBA functions such as VACUUM 
and ANALYZE should not be done.  Each will run forever and cause much grief.  
Backups are problematic in the traditional pg_dump and PITR space.  Large 
JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test 
case) are no-no's.  A system or database crash could take potentially hours to 
days to recover.  There are likely other issues ahead.

 

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, 
BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated 
and don't get me started on "Hadoop".  I looked at many others and ultimately 
the recommended use of each vendor was to have one table for all data.  That 
overcomes the millions of tables problem, right?

 

Problem with the "one big table" solution is I anticipate 1,200 trillion 
records.  Random access is expected and the customer expects <30ms reads for a 
single record fetch.

 

No data is loaded... yet  Table and index creation only.  I am interested in 
the opinions of all including tests I may perform.  If you had this setup, what 
would you capture / analyze?  I have a job running preparing data.  I did this 
on a much smaller scale (50k tables) and data load via function allowed close 
to 6,000 records/second.  The schema has been simplified since and last test 
reach just over 20,000 records/second with 300k tables.

 

I'm not looking for alternatives yet but input to my test.  Takers?

 

I can't promise immediate feedback but will do my best to respond with results.

 

TIA,

-Greg

 

I have not seen any mention of transaction ID wraparound mentioned in this 
thread yet. With the numbers that you are looking at, I could see this as a 
major issue.

 

T

 

Thank you Terry.  You get the gold star.  :)   I was waiting for that to come 
up.

 

Success means handling this condition.  A whole database vacuum and 
dump-restore is out of the question.  Can a properly tuned autovacuum prevent 
the situation?

 

-Greg

 

Hi!

With millions of tables you have to setautovacuum_max_workers  sky-high =). 
We have some situation when at thousands of tables autovacuum can’t vacuum all 
tables that need it. Simply it vacuums some of most modified table and never 
reach others. Only manual vacuum can help with this situation. With wraparound 
issue it can be a nightmare 

 

--

Alex Ignatov 
Postgres Professional:   http://www.postgrespro.com 
The Russian Postgres Company

 



Re: [PERFORM] Millions of tables

2016-09-29 Thread Simon Riggs
On 26 September 2016 at 05:19, Greg Spiegelberg  wrote:
> I did look at PostgresXL and CitusDB.  Both are admirable however neither
> could support the need to read a random record consistently under 30ms.
> It's a similar problem Cassandra and others have: network latency.  At this
> scale, to provide the ability to access any given record amongst trillions
> it is imperative to know precisely where it is stored (system & database)
> and read a relatively small index.  I have other requirements that prohibit
> use of any technology that is eventually consistent.

Then XL is exactly what you need, since it does allow you to calculate
exactly where the record is via hash and then access it, which makes
the request just a single datanode task.

XL is not the same as CitusDB.

> I liken the problem to fishing.  To find a particular fish of length, size,
> color  in a data lake you must accept the possibility of scanning the
> entire lake.  However, if all fish were in barrels where each barrel had a
> particular kind of fish of specific length, size, color  then the problem
> is far simpler.

The task of putting the fish in the appropriate barrel is quite hard.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance