Re: [GENERAL] optimizing a query
On Jun 21, 2016, at 6:55 PM, David G. Johnston wrote: > Aside from the name these indexes are identical... sorry. tired eyes copy/pasting between windows and trying to 'average' out 40 similar queries. > These two items combined reduce the desirability of diagnosing this...it > doesn't seem like you've faithfully recreated the scenario for us to evaluate. > > Your post is also not self-contained and you haven't provided the actual > EXPLAINs you are getting. I played around with some more indexes, creating and disabling them on one specific query Eventually i found some index formats that didn't pull in the whole table. They gave approximately the same results as the other selects, with some differences in reporting. the heap scan on the table was negligible. the big hit was off the outer hash join. the formatting in explain made a negligible check look like it was the root issue CREATE TABLE t_a (id SERIAL PRIMARY KEY, col_1 INT NOT NULL, col_2 BOOLEAN DEFAULT NULL ); CREATE INDEX test_idx__t_a_col1_col2__v1 on t_a (col_1) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__t_a_col1_col2__v2 on t_a (col_1, id) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__t_a_col1_col2__v3 on t_a (id, col_1) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__t_a_col1_col2__v4 on t_a (id, col_1, col_2) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__t_a_col1_col2__v5 on t_a (col_1, col_2) WHERE col_2 IS NOT FALSE; CREATE TABLE t_b (id SERIAL PRIMARY KEY, col_1 INT NOT NULL, col_2 BOOLEAN DEFAULT NULL ); CREATE TABLE t_a2b (a_id INT NOT NULL REFERENCES t_a(id), b_id INT NOT NULL REFERENCES t_b(id), col_a INT NOT NULL, PRIMARY KEY (a_id, b_id) ); EXPLAIN ANALYZE SELECT t_a2b.b_id AS t_a2b_b_id, count(t_a2b.b_id) AS counted FROM t_a2b JOIN t_a ON t_a2b.a_id = t_a.id WHERE t_a.col_1 = 730 AND t_a2b.col_a = 1 AND (t_a.col_2 IS NOT False) GROUP BY t_a2b.b_id ORDER BY counted DESC, t_a2b.b_id ASC LIMIT 25 OFFSET 0 ; QUERY PLAN Limit (cost=270851.55..270851.62 rows=25 width=4) (actual time=1259.950..1259.953 rows=25 loops=1) -> Sort (cost=270851.55..270863.43 rows=4750 width=4) (actual time=1259.945..1259.945 rows=25 loops=1) Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id Sort Method: top-N heapsort Memory: 26kB -> HashAggregate (cost=270670.01..270717.51 rows=4750 width=4) (actual time=1259.430..1259.769 rows=1231 loops=1) Group Key: t_a2b.b_id -> Hash Join (cost=171148.45..270516.71 rows=30660 width=4) (actual time=107.662..1230.481 rows=124871 loops=1) Hash Cond: (t_a2b.a_id = t_a.id) -> Seq Scan on t_a2b (cost=0.00..89741.18 rows=2485464 width=8) (actual time=0.011..661.978 rows=2492783 loops=1) Filter: (col_a = 1) Rows Removed by Filter: 2260712 -> Hash (cost=170446.87..170446.87 rows=56126 width=4) (actual time=107.409..107.409 rows=48909 loops=1) Buckets: 65536 Batches: 1 Memory Usage: 2232kB -> Bitmap Heap Scan on t_a (cost=1055.41..170446.87 rows=56126 width=4) (actual time=18.243..94.470 rows=48909 loops=1) Recheck Cond: ((col_1 = 730) AND (col_2 IS NOT FALSE)) Heap Blocks: exact=43972 -> Bitmap Index Scan on test_idx__t_a_col1_col2__v2 (cost=0.00..1041.38 rows=56126 width=0) (actual time=8.661..8.661 rows=48909 loops=1) Index Cond: (col_1 = 730) Planning time: 0.796 ms Execution time: 1260.092 ms QUERY PLAN -- Limit (cost=208239.59..208239.65 rows=25 width=4) (actual time=1337.739..1337.743 rows=25 loops=1) -> Sort (cost=208239.59..208251.47 rows=4750 width=4) (actual time=1337.737..1337.739 rows=25 loops=1) Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id Sort Method: top-N heapsort Memory: 26kB -> HashAggregate (cost=208058.05..208105.55 rows=4750 width=4)
Re: [GENERAL] Help needed structuring Postgresql correlation query
Thanks for that, looks like something to sink my teeth into ! On 21 June 2016 at 13:29, Alban Hertroyswrote: > >> On 19 Jun 2016, at 10:58, Tim Smith wrote: >> >> Hi, >> >> My postgresql-fu is not good enough to write a query to achieve this >> (some may well say r is a better suited tool to achieve this !). >> >> I need to calculate what I would call a correlation window on a time >> series of data, my table looks like this : >> >> create table data(data_date date,data_measurement numeric); >> insert into data values('2016-01-01',16.23); >> >> insert into data values('2016-06-19',30.54); >> >> My "target sample" would be the N most recent samples in the table >> (e.g. 20, the most recent 20 days) >> >> My "potential sample" would be a moving window of size N (the same >> size N as above), starting at T0 (i.e. 2016-01-01 in this example) and >> incrementing by one (i.e. 2016-01-01 + 20, then 2016-01-02+20 etc), >> but the "target sample" would obviously be excluded. >> >> The output needs to display window date range (or at least the start >> date of the "potential sample" window) and the result >> corr(target,potential). >> >> Hope that makes sense > > Something like this could do the trick (untested): > > with recursive sample (nr, start_date) as ( > select 1 as nr, data_date as start_date, > SUM(data_measurement) as total > from generate_series(0, 19) range(step) > left join data on (data_date = start_date + range.step) > > union all > > select nr + 1, sample.start_date +1, SUM(data_measurement) as > total > from sample > join generate_series(0, 19) range(step) > left join data on (data_date = start_date +1 + range.step) > where start_date +1 +19 <= (select MAX(data_date) from data) > group by 1, 2 > ) > select * from sample where start_date >= '2016-01-01'; > > Not sure how best to go about parameterising sample size N, a stored function > seems like a good option. > > > Another approach would be to move a (cumulative) window-function with 20 > items over your data set and for each row subtract the first value of the > previous window from the total of the current window (that is, assuming > you're calculating a SUM of data_measurement for each window of 20 records). > > Visually that looks something like this for sample size 4: > sample 1: (A + B + C + D) > sample 2: (A + B + C + D) + E - A = (B + C + D + E) > sample 3: (B + C + D + E) + F - B = (C + D + E + F) > etc. > > To accomplish this, you calculate two cumulative totals (often misnamed as > running totals, but AFAIK that's something different), one from the start, > and one lagging N rows behind (you can use the lag() window function for > that) and subtract the two. > > Good luck! > > Alban Hertroys > -- > If you can't see the forest for the trees, > cut the trees and you'll find there is no forest. > -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] optimizing a query
On Tue, Jun 21, 2016 at 6:44 PM, Adrian Klaverwrote: > On 06/21/2016 03:33 PM, Jonathan Vanasco wrote: > >> >> >> In effort of simplifying the work, I've created indexes on t_a that have >> all the related columns. >> >> CREATE INDEX test_idx ON t_a(col_1, id) WHERE col_2 IS NOT FALSE; >> CREATE INDEX test_idx__a ON t_a(col_1, id) WHERE col_2 IS NOT >> FALSE; >> > Aside from the name these indexes are identical... >> postgres will query test_idx__a first (yay!) but then does a bitmap heap >> scan on t_a, and uses the raw t_a for the hash join. >> >> I don't actually need any information from t_a - it's just there for the >> filtering, and ideally postgres would just use the index. >> > This is the description of a semi-join. WHERE EXISTS (SELECT 1 FROM t_a WHERE t_a.id = t_a2b.a_id AND t_a.col_1 = 730 AND t_a.col_2 IS NOT FALSE) >> I thought this might have been from using a partial index, but the same >> results happen with a full index. I just can't seem to avoid this hash >> join against the full table. >> >> anyone have a suggestion? >> >> > The below works without including t_a in the FROM? > > >> example query >> >> SELECT t_a2b.b_id AS b_id, >>count(t_a2b.b_id) AS counted >> FROM t_a2b >> WHERE >> t_a2b.col_a = 1 >> AND >> t_a.col_1 = 730 >> AND >> t_a.col_2 IS NOT False >> GROUP BY t_a2b.b_id >> ORDER BYcounted DESC, >> t_a2b.b_id ASC >> >> These two items combined reduce the desirability of diagnosing this...it doesn't seem like you've faithfully recreated the scenario for us to evaluate. Your post is also not self-contained and you haven't provided the actual EXPLAINs you are getting. David J.
Re: [GENERAL] optimizing a query
On 06/21/2016 03:33 PM, Jonathan Vanasco wrote: I have a handful of queries in the following general form that I can't seem to optimize any further (same results on 9.3, 9.4, 9.5) I'm wondering if anyone might have a suggestion, or if they're done. The relevant table structure: t_a2b a_id INT references t_a(id) b_id INT references t_b(id) col_a t_a id INT col_1 INT col_2 BOOL The selects query the association table (t_a2b) and join in a related table (t_a) for some filtering. In effort of simplifying the work, I've created indexes on t_a that have all the related columns. CREATE INDEX test_idx ON t_a(col_1, id) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__a ON t_a(col_1, id) WHERE col_2 IS NOT FALSE; postgres will query test_idx__a first (yay!) but then does a bitmap heap scan on t_a, and uses the raw t_a for the hash join. I don't actually need any information from t_a - it's just there for the filtering, and ideally postgres would just use the index. I thought this might have been from using a partial index, but the same results happen with a full index. I just can't seem to avoid this hash join against the full table. anyone have a suggestion? The below works without including t_a in the FROM? example query SELECT t_a2b.b_id AS b_id, count(t_a2b.b_id) AS counted FROM t_a2b WHERE t_a2b.col_a = 1 AND t_a.col_1 = 730 AND t_a.col_2 IS NOT False GROUP BY t_a2b.b_id ORDER BYcounted DESC, t_a2b.b_id ASC -- Adrian Klaver adrian.kla...@aklaver.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] optimizing a query
I have a handful of queries in the following general form that I can't seem to optimize any further (same results on 9.3, 9.4, 9.5) I'm wondering if anyone might have a suggestion, or if they're done. The relevant table structure: t_a2b a_id INT references t_a(id) b_id INT references t_b(id) col_a t_a id INT col_1 INT col_2 BOOL The selects query the association table (t_a2b) and join in a related table (t_a) for some filtering. In effort of simplifying the work, I've created indexes on t_a that have all the related columns. CREATE INDEX test_idx ON t_a(col_1, id) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__a ON t_a(col_1, id) WHERE col_2 IS NOT FALSE; postgres will query test_idx__a first (yay!) but then does a bitmap heap scan on t_a, and uses the raw t_a for the hash join. I don't actually need any information from t_a - it's just there for the filtering, and ideally postgres would just use the index. I thought this might have been from using a partial index, but the same results happen with a full index. I just can't seem to avoid this hash join against the full table. anyone have a suggestion? example query SELECT t_a2b.b_id AS b_id, count(t_a2b.b_id) AS counted FROM t_a2b WHERE t_a2b.col_a = 1 AND t_a.col_1 = 730 AND t_a.col_2 IS NOT False GROUP BY t_a2b.b_id ORDER BYcounted DESC, t_a2b.b_id ASC -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] does timestamp precision affect storage size?
On Jun 21, 2016, at 4:50 PM, Tom Lane wrote: > Storage-wise, no. If you have a resolution spec on your columns now, > I think dropping the resolution spec would save you a few nanoseconds per > row insertion due to not having to apply the roundoff function. Adding > one would certainly not improve speed. On Jun 21, 2016, at 4:47 PM, Vik Fearing wrote: > No, there are no space savings here. > > =# select pg_column_size('now'::timestamptz(0)), > pg_column_size('now'::timestamptz); Thanks. I thought that was happening, but wanted to make sure. the allure of shaving a byte or two off some rows couldn't be ignored ;) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] does timestamp precision affect storage size?
Jonathan Vanascowrites: > would there be any savings in storage or performance improvements from losing > the resolution on fractional seconds, Storage-wise, no. If you have a resolution spec on your columns now, I think dropping the resolution spec would save you a few nanoseconds per row insertion due to not having to apply the roundoff function. Adding one would certainly not improve speed. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] does timestamp precision affect storage size?
On 21/06/16 22:39, Jonathan Vanasco wrote: > i'm cleaning up some queries for performance, and noticed that we never use > precision beyond the second (ie, `timestamp(0)`) in our business logic. > > would there be any savings in storage or performance improvements from losing > the resolution on fractional seconds, or are `timestamp(precision)` > effectively the same for storage as `timestamp`? (based on docs, I assume > the latter but wanted to check) No, there are no space savings here. =# select pg_column_size('now'::timestamptz(0)), pg_column_size('now'::timestamptz); pg_column_size | pg_column_size + 8 | 8 (1 row) -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] does timestamp precision affect storage size?
i'm cleaning up some queries for performance, and noticed that we never use precision beyond the second (ie, `timestamp(0)`) in our business logic. would there be any savings in storage or performance improvements from losing the resolution on fractional seconds, or are `timestamp(precision)` effectively the same for storage as `timestamp`? (based on docs, I assume the latter but wanted to check) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Help with namespaces in xpath (PostgreSQL 9.5.3)
Thank you David. -Allan. On Mon, Jun 20, 2016 at 11:19 PM, David G. Johnston < david.g.johns...@gmail.com> wrote: > On Sun, Jun 19, 2016 at 5:09 PM, Allan Kamauwrote: > >> I have an xml document from which I would like to extract the contents of >> several elements. >> >> I would like to use xpath to extract the contents of "name" from the xml >> document shown below. >> >> WITH x AS >> ( >> SELECT >> ' >> http://uniprot.org/uniprot; xmlns:xsi=" >> http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=" >> http://uniprot.org/uniprot >> http://www.uniprot.org/support/docs/uniprot.xsd;> >> > version="56"> >> A0JM59 >> UBP20_XENTR >> >> >> '::xml AS d >> ) >> SELECT (xpath('/uniprot/entry/name/text()',a.d))[1]::text AS uniprot_name >> FROM >> x AS a >> ; >> >> The documentation for xpath() (" >> https://www.postgresql.org/docs/9.5/static/functions-xml.html;) >> describes "xpath(xpath, xml [, nsarray]"). >> >> For the above xml document, what would be the two dimensional array >> "nsarray" for the xpath() function? >> > > Is there a specific part of the description and two examples that doesn't > make sense to you? > > Or more specifically, do you understand what namespaces are? > > ARRAY[ > ARRAY['defaultns','http://uniprot.org/uniprot'], > ARRAY['xsi','http://www.w3.org/2001/XMLSchema-instance'] > ] > > In effect when the xpath function parses the XML document it tosses away > all of the document-local namespace aliases and instead associated the full > namespace URI with each element (in the DOM). Since, in the xpath > expression, usually you'd want to refer to nodes in the DOM via their > namespace alias you need to tell the xpath function which aliases you > intend to use in the xpath and which full URI they correspond to. > Furthermore, there is not concept of a default namespace in the xpath > expression. So while you can simply copy-paste the aliases and URIs from > all of the non-default namespace aliases you must also choose a unique > alias for the default namespace in the original document. > > In the above I've copied the alias and namespace URI for the named "xsi" > alias and gave the name "defaultns" to the original document's default > namespace URI. > > David J. > >
Re: [HACKERS] [GENERAL] PgQ and pg_dump
2016-06-21 13:08 GMT-03:00 Robert Haas: > On Thu, Jun 16, 2016 at 1:46 PM, Martín Marqués > wrote: >> The comment is accurate on what is going to be dumpable and what's not >> from the code. In our case, as the pgq schema is not dumpable becaause >> it comes from an extension, other objects it contain will not be >> dumpable as well. >> >> That's the reason why the PgQ event tables created by >> pgq.create_queue() are not dumped. > > That sucks. Yes, and I'm surprised we haven't had any bug report yet on inconsistent dumps. The patch that changed pg_dump's behavior on extension objects is more then a year old. I'll find some time today to add tests and check for other objects that are not dumped for the same reason. Cheers, -- Martín Marquéshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [HACKERS] [GENERAL] PgQ and pg_dump
On Thu, Jun 16, 2016 at 1:46 PM, Martín Marquéswrote: > The comment is accurate on what is going to be dumpable and what's not > from the code. In our case, as the pgq schema is not dumpable becaause > it comes from an extension, other objects it contain will not be > dumpable as well. > > That's the reason why the PgQ event tables created by > pgq.create_queue() are not dumped. That sucks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
R: R: [GENERAL] Vacuum full: alternatives?
Hello, very interesting comments and contributions, thank you. >I've just tested pg_bulkload with the default settings, and it >definitely isn't using the fsm to re-use freed space in the table. If >they use WRITER = BUFFERED it would, though. So with WRITER = BUFFERED it should be slower but free-marked space should be reused again? Thank you! Francesco Da: Jeff Janes [jeff.ja...@gmail.com] Inviato: lunedì 20 giugno 2016 17.51 A: Martín Marqués Cc: Melvin Davidson; Rakesh Kumar; Job; pgsql-general@postgresql.org Oggetto: Re: R: [GENERAL] Vacuum full: alternatives? On Mon, Jun 20, 2016 at 7:23 AM, Martín Marquéswrote: > El 20/06/16 a las 09:50, Melvin Davidson escribió: >> >> >>>but it won't let it grow too (or am I missing something). >> >> Yes, you are missing something. By partioning and {Vacuum Full only the >> table with data no longer needed}, the rest of the data remains >> available to the users >> AND space is reclaimed by the O/S, so it's the best of both worlds. > > That's not entirely true. Think about a SELECT which has to scan all > child tables. > Yes, for the partitioning to be a good option, you would probably have to arrange it such that you can prove that all tuples in a given partition are eligible for deletion (or have already been deleted), and then either truncate or dis-inherit the partition. That still requires a stringent lock, but it is only held for a very short time. > Your are also adding another layer of complexity to the system. I think that using pg_bulkload adds more complexity to the system than partitioning would. I wonder if they really need to use that, or if they just picked it over COPY because it sounded like a free lunch. Cheers, Jeff -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Help needed structuring Postgresql correlation query
> On 19 Jun 2016, at 10:58, Tim Smithwrote: > > Hi, > > My postgresql-fu is not good enough to write a query to achieve this > (some may well say r is a better suited tool to achieve this !). > > I need to calculate what I would call a correlation window on a time > series of data, my table looks like this : > > create table data(data_date date,data_measurement numeric); > insert into data values('2016-01-01',16.23); > > insert into data values('2016-06-19',30.54); > > My "target sample" would be the N most recent samples in the table > (e.g. 20, the most recent 20 days) > > My "potential sample" would be a moving window of size N (the same > size N as above), starting at T0 (i.e. 2016-01-01 in this example) and > incrementing by one (i.e. 2016-01-01 + 20, then 2016-01-02+20 etc), > but the "target sample" would obviously be excluded. > > The output needs to display window date range (or at least the start > date of the "potential sample" window) and the result > corr(target,potential). > > Hope that makes sense Something like this could do the trick (untested): with recursive sample (nr, start_date) as ( select 1 as nr, data_date as start_date, SUM(data_measurement) as total from generate_series(0, 19) range(step) left join data on (data_date = start_date + range.step) union all select nr + 1, sample.start_date +1, SUM(data_measurement) as total from sample join generate_series(0, 19) range(step) left join data on (data_date = start_date +1 + range.step) where start_date +1 +19 <= (select MAX(data_date) from data) group by 1, 2 ) select * from sample where start_date >= '2016-01-01'; Not sure how best to go about parameterising sample size N, a stored function seems like a good option. Another approach would be to move a (cumulative) window-function with 20 items over your data set and for each row subtract the first value of the previous window from the total of the current window (that is, assuming you're calculating a SUM of data_measurement for each window of 20 records). Visually that looks something like this for sample size 4: sample 1: (A + B + C + D) sample 2: (A + B + C + D) + E - A = (B + C + D + E) sample 3: (B + C + D + E) + F - B = (C + D + E + F) etc. To accomplish this, you calculate two cumulative totals (often misnamed as running totals, but AFAIK that's something different), one from the start, and one lagging N rows behind (you can use the lag() window function for that) and subtract the two. Good luck! Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Regression tests (Background Workers)
On Tue, Jun 21, 2016 at 2:02 PM, Dhariniwrote: > Background process is initialized at server start and when trying to run the > tests i get the following error. > > $ make installcheck > (using postmaster on Unix socket, default port) > == dropping database "contrib_regression" == > ERROR: DROP DATABASE cannot be executed from a function or multi-command > string > command failed: "/postgres/install/bin/psql" -X -c "DROP DATABASE IF EXISTS > \"contrib_regression\"" "postgres" > make: *** [installcheck] Error 2 Which test are you trying to run? That's not from the in-core source tree, right? And on which version of Postgres is this attempt tried? With this level of details that's hard to know what's going on, one can just guess that PreventTransactionChain is being called on DROP DATABASE.. -- Michael -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general