Re: [PERFORM] query a table with lots of coulmns
At first, thanks for your fast and comprehensive help. The structure of my cache table is a text , b text NOT NULL , c text , d text , e timestamp without timezone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0 additional constraints: primary key (b) , Unique(b), Unique(a) Indexes : Index on a, Index on b This table has 30 Mio rows ( will increase to 50 Mio) in future My working table is b text, g integer Indexes on b and c This table has 5 Mio rows Scenario: What I want to achieve : SELECT s1,s2,s3,...s512,g,d from worktable INNER JOIN cachetable USING(b) ORDER BY g The inner join will match at least 95 % of columns of the smaller worktable in this example 4,75 mio rows. Running this query takes several hours until I receive the first results. Query analyzing shows that the execution plan is doing 2 seq table scans on cache and work table. When I divide this huge statement into SELECT s1,s2,s3,...s512,g,d from worktable INNER JOIN cachetable USING(b) WHERE g BETWEEN 1 and 1 ORDER BY g, SELECT s1,s2,s3,...s512,g,d from worktable INNER JOIN cachetable USING(b) WHERE g BETWEEN 10001 and 2 ORDER BY g, (I can do this because g i unique and continous id from 1 to N) The result is fast but fireing parallel requests (4-8 times parallel) slows down the retrieval. Execution plan changes when adding BETWEEN 1 and 1 to use the indexes. One remark which might help: overall 90 - 95 % of the s1-s512 columns are 0. I am only interested in columns not equals 0. Perhaps it would make sense to use and array of json and enumerate only values not equals 0. Statistics on the large table: table size: 80 GB toast-tablesize: 37 GB size of indexes: 17 GB Thanks for your help and ideas Björn Am 19.09.2014 23:40, schrieb Josh Berkus: On 09/19/2014 04:51 AM, Björn Wittich wrote: I am relatively new to postgres. I have a table with 500 coulmns and about 40 mio rows. I call this cache table where one column is a unique key (indexed) and the 499 columns (type integer) are some values belonging to this key. Now I have a second (temporary) table (only 2 columns one is the key of my cache table) and I want do an inner join between my temporary table and the large cache table and export all matching rows. I found out, that the performance increases when I limit the join to lots of small parts. But it seems that the databases needs a lot of disk io to gather all 499 data columns. Is there a possibilty to tell the databases that all these colums are always treated as tuples and I always want to get the whole row? Perhaps the disk oraganization could then be optimized? PostgreSQL is already a row store, which means by default you're getting all of the columns, and the columns are stored physically adjacent to each other. If requesting only 1 or two columns is faster than requesting all of them, that's pretty much certainly due to transmission time, not disk IO. Otherwise, please post your schema (well, a truncated version) and your queries. BTW, in cases like yours I've used a INT array instead of 500 columns to good effect; it works slightly better with PostgreSQL's compression. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query a table with lots of coulmns
At first, thanks for your fast and comprehensive help. The structure of my cache table is a text , b text NOT NULL , c text , d text , e timestamp without timezone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0 additional constraints: primary key (b) , Unique(b), Unique(a) Indexes : Index on a, Index on b This looks redundant. e.g. you don't need a unique index on b if you already have a primary key on it. Can you post the complete table definition ? ... One remark which might help: overall 90 - 95 % of the s1-s512 columns are 0. I am only interested in columns not equals 0. Perhaps it would make sense to use and array of json and enumerate only values not equals 0. Could you change that to replace 0 values with NULLs? This would greatly reduce your table space as Postgres is very efficient about NULLs storage: It marks all null values in a bit map within the row header so you just need about one bit per null instead of 4 bytes for zeros, and hence get rid of your I/O issue. regards, Marc Mamin Von: pgsql-performance-ow...@postgresql.org [pgsql-performance-ow...@postgresql.org]quot; im Auftrag von quot;Björn Wittich [bjoern_witt...@gmx.de] Gesendet: Samstag, 20. September 2014 09:19 An: Josh Berkus; pgsql-performance@postgresql.org Betreff: Re: [PERFORM] query a table with lots of coulmns At first, thanks for your fast and comprehensive help. The structure of my cache table is a text , b text NOT NULL , c text , d text , e timestamp without timezone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0 additional constraints: primary key (b) , Unique(b), Unique(a) Indexes : Index on a, Index on b This table has 30 Mio rows ( will increase to 50 Mio) in future My working table is b text, g integer Indexes on b and c This table has 5 Mio rows Scenario: What I want to achieve : SELECT s1,s2,s3,...s512,g,d from worktable INNER JOIN cachetable USING(b) ORDER BY g The inner join will match at least 95 % of columns of the smaller worktable in this example 4,75 mio rows. Running this query takes several hours until I receive the first results. Query analyzing shows that the execution plan is doing 2 seq table scans on cache and work table. When I divide this huge statement into SELECT s1,s2,s3,...s512,g,d from worktable INNER JOIN cachetable USING(b) WHERE g BETWEEN 1 and 1 ORDER BY g, SELECT s1,s2,s3,...s512,g,d from worktable INNER JOIN cachetable USING(b) WHERE g BETWEEN 10001 and 2 ORDER BY g, (I can do this because g i unique and continous id from 1 to N) The result is fast but fireing parallel requests (4-8 times parallel) slows down the retrieval. Execution plan changes when adding BETWEEN 1 and 1 to use the indexes. One remark which might help: overall 90 - 95 % of the s1-s512 columns are 0. I am only interested in columns not equals 0. Perhaps it would make sense to use and array of json and enumerate only values not equals 0. Statistics on the large table: table size: 80 GB toast-tablesize: 37 GB size of indexes: 17 GB Thanks for your help and ideas Björn Am 19.09.2014 23:40, schrieb Josh Berkus: On 09/19/2014 04:51 AM, Björn Wittich wrote: I am relatively new to postgres. I have a table with 500 coulmns and about 40 mio rows. I call this cache table where one column is a unique key (indexed) and the 499 columns (type integer) are some values belonging to this key. Now I have a second (temporary) table (only 2 columns one is the key of my cache table) and I want do an inner join between my temporary table and the large cache table and export all matching rows. I found out, that the performance increases when I limit the join to lots of small parts. But it seems that the databases needs a lot of disk io to gather all 499 data columns. Is there a possibilty to tell the databases that all these colums are always treated as tuples and I always want to get the whole row? Perhaps the disk oraganization could then be optimized? PostgreSQL is already a row store, which means by default you're getting all of the columns, and the columns are stored physically adjacent to each other. If requesting only 1 or two columns is faster than requesting all of them, that's pretty much certainly due to transmission time, not disk IO. Otherwise, please post your schema (well, a truncated version) and your queries. BTW, in cases like yours I've used a INT array instead of 500 columns to good effect; it works slightly better with PostgreSQL's compression. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription:
Re: [PERFORM] query a table with lots of coulmns
Hi, ok here are my schemata : cachetable : 30 - 50 Mio rows, worktable 5 Mio - 25 Mio CREATE TABLE cachetable ( a text, b text NOT NULL, c text, d text, e timestamp without time zone DEFAULT now(), f text, s1 integer DEFAULT 0, s2 integer DEFAULT 0, s3 integer DEFAULT 0, s4 integer DEFAULT 0, s5 integer DEFAULT 0, s6 integer DEFAULT 0, s7 integer DEFAULT 0, s8 integer DEFAULT 0, s9 integer DEFAULT 0, s10 integer DEFAULT 0, s11 integer DEFAULT 0, s12 integer DEFAULT 0, s13 integer DEFAULT 0, s14 integer DEFAULT 0, s15 integer DEFAULT 0, s16 integer DEFAULT 0, s17 integer DEFAULT 0, s18 integer DEFAULT 0, s19 integer DEFAULT 0, s20 integer DEFAULT 0, s21 integer DEFAULT 0, s22 integer DEFAULT 0, s23 integer DEFAULT 0, s24 integer DEFAULT 0, s25 integer DEFAULT 0, s26 integer DEFAULT 0, s27 integer DEFAULT 0, s28 integer DEFAULT 0, s29 integer DEFAULT 0, s30 integer DEFAULT 0, s31 integer DEFAULT 0, s32 integer DEFAULT 0, s33 integer DEFAULT 0, s34 integer DEFAULT 0, s35 integer DEFAULT 0, s36 integer DEFAULT 0, s37 integer DEFAULT 0, s38 integer DEFAULT 0, s39 integer DEFAULT 0, s40 integer DEFAULT 0, s41 integer DEFAULT 0, s42 integer DEFAULT 0, s43 integer DEFAULT 0, s44 integer DEFAULT 0, s45 integer DEFAULT 0, s46 integer DEFAULT 0, s47 integer DEFAULT 0, s48 integer DEFAULT 0, s49 integer DEFAULT 0, s50 integer DEFAULT 0, s51 integer DEFAULT 0, s52 integer DEFAULT 0, s53 integer DEFAULT 0, s54 integer DEFAULT 0, s55 integer DEFAULT 0, s56 integer DEFAULT 0, s57 integer DEFAULT 0, s58 integer DEFAULT 0, s59 integer DEFAULT 0, s60 integer DEFAULT 0, s61 integer DEFAULT 0, s62 integer DEFAULT 0, s63 integer DEFAULT 0, s64 integer DEFAULT 0, s65 integer DEFAULT 0, s66 integer DEFAULT 0, s67 integer DEFAULT 0, s68 integer DEFAULT 0, s69 integer DEFAULT 0, s70 integer DEFAULT 0, s71 integer DEFAULT 0, s72 integer DEFAULT 0, s73 integer DEFAULT 0, s74 integer DEFAULT 0, s75 integer DEFAULT 0, s76 integer DEFAULT 0, s77 integer DEFAULT 0, s78 integer DEFAULT 0, s79 integer DEFAULT 0, s80 integer DEFAULT 0, s81 integer DEFAULT 0, s82 integer DEFAULT 0, s83 integer DEFAULT 0, s84 integer DEFAULT 0, s85 integer DEFAULT 0, s86 integer DEFAULT 0, s87 integer DEFAULT 0, s88 integer DEFAULT 0, s89 integer DEFAULT 0, s90 integer DEFAULT 0, s91 integer DEFAULT 0, s92 integer DEFAULT 0, s93 integer DEFAULT 0, s94 integer DEFAULT 0, s95 integer DEFAULT 0, s96 integer DEFAULT 0, s97 integer DEFAULT 0, s98 integer DEFAULT 0, s99 integer DEFAULT 0, s100 integer DEFAULT 0, s101 integer DEFAULT 0, s102 integer DEFAULT 0, s103 integer DEFAULT 0, s104 integer DEFAULT 0, s105 integer DEFAULT 0, s106 integer DEFAULT 0, s107 integer DEFAULT 0, s108 integer DEFAULT 0, s109 integer DEFAULT 0, s110 integer DEFAULT 0, s111 integer DEFAULT 0, s112 integer DEFAULT 0, s113 integer DEFAULT 0, s114 integer DEFAULT 0, s115 integer DEFAULT 0, s116 integer DEFAULT 0, s117 integer DEFAULT 0, s118 integer DEFAULT 0, s119 integer DEFAULT 0, s120 integer DEFAULT 0, s121 integer DEFAULT 0, s122 integer DEFAULT 0, s123 integer DEFAULT 0, s124 integer DEFAULT 0, s125 integer DEFAULT 0, s126 integer DEFAULT 0, s127 integer DEFAULT 0, s128 integer DEFAULT 0, s129 integer DEFAULT 0, s130 integer DEFAULT 0, s131 integer DEFAULT 0, s132 integer DEFAULT 0, s133 integer DEFAULT 0, s134 integer DEFAULT 0, s135 integer DEFAULT 0, s136 integer DEFAULT 0, s137 integer DEFAULT 0, s138 integer DEFAULT 0, s139 integer DEFAULT 0, s140 integer DEFAULT 0, s141 integer DEFAULT 0, s142 integer DEFAULT 0, s143 integer DEFAULT 0, s144 integer DEFAULT 0, s145 integer DEFAULT 0, s146 integer DEFAULT 0, s147 integer DEFAULT 0, s148 integer DEFAULT 0, s149 integer DEFAULT 0, s150 integer DEFAULT 0, s151 integer DEFAULT 0, s152 integer DEFAULT 0, s153 integer DEFAULT 0, s154 integer DEFAULT 0, s155 integer DEFAULT 0, s156 integer DEFAULT 0, s157 integer DEFAULT 0, s158 integer DEFAULT 0, s159 integer DEFAULT 0, s160 integer DEFAULT 0, s161 integer DEFAULT 0, s162 integer DEFAULT 0, s163 integer DEFAULT 0, s164 integer DEFAULT 0, s165 integer DEFAULT 0, s166 integer DEFAULT 0, s167 integer DEFAULT 0, s168 integer DEFAULT 0, s169 integer DEFAULT 0, s170 integer DEFAULT 0, s171 integer DEFAULT 0, s172 integer DEFAULT 0, s173 integer DEFAULT 0, s174 integer DEFAULT 0, s175 integer DEFAULT 0, s176 integer DEFAULT 0, s177 integer DEFAULT 0, s178 integer DEFAULT 0, s179 integer DEFAULT 0, s180 integer DEFAULT 0, s181 integer DEFAULT 0, s182 integer DEFAULT 0, s183 integer DEFAULT 0, s184 integer DEFAULT 0, s185 integer DEFAULT 0, s186 integer DEFAULT 0, s187 integer DEFAULT 0, s188 integer
Re: [PERFORM] query a table with lots of coulmns
On 19 September 2014 13:51, Björn Wittich bjoern_witt...@gmx.de wrote: Hi mailing list, I am relatively new to postgres. I have a table with 500 coulmns and about 40 mio rows. I call this cache table where one column is a unique key (indexed) and the 499 columns (type integer) are some values belonging to this key. Now I have a second (temporary) table (only 2 columns one is the key of my cache table) and I want do an inner join between my temporary table and the large cache table and export all matching rows. I found out, that the performance increases when I limit the join to lots of small parts. But it seems that the databases needs a lot of disk io to gather all 499 data columns. Is there a possibilty to tell the databases that all these colums are always treated as tuples and I always want to get the whole row? Perhaps the disk oraganization could then be optimized? Hi, do you have indexes on the columns you use for joins? Szymon
Re: [PERFORM] query a table with lots of coulmns
Hi Szymon, yes I have indexes on both columns (one in each table) which I am using for join operation. Am 19.09.2014 14:04, schrieb Szymon Guz: On 19 September 2014 13:51, Björn Wittich bjoern_witt...@gmx.de mailto:bjoern_witt...@gmx.de wrote: Hi mailing list, I am relatively new to postgres. I have a table with 500 coulmns and about 40 mio rows. I call this cache table where one column is a unique key (indexed) and the 499 columns (type integer) are some values belonging to this key. Now I have a second (temporary) table (only 2 columns one is the key of my cache table) and I want do an inner join between my temporary table and the large cache table and export all matching rows. I found out, that the performance increases when I limit the join to lots of small parts. But it seems that the databases needs a lot of disk io to gather all 499 data columns. Is there a possibilty to tell the databases that all these colums are always treated as tuples and I always want to get the whole row? Perhaps the disk oraganization could then be optimized? Hi, do you have indexes on the columns you use for joins? Szymon
Re: [PERFORM] query a table with lots of coulmns
2014-09-19 13:51 GMT+02:00 Björn Wittich bjoern_witt...@gmx.de: Hi mailing list, I am relatively new to postgres. I have a table with 500 coulmns and about 40 mio rows. I call this cache table where one column is a unique key (indexed) and the 499 columns (type integer) are some values belonging to this key. Now I have a second (temporary) table (only 2 columns one is the key of my cache table) and I want do an inner join between my temporary table and the large cache table and export all matching rows. I found out, that the performance increases when I limit the join to lots of small parts. But it seems that the databases needs a lot of disk io to gather all 499 data columns. Is there a possibilty to tell the databases that all these colums are always treated as tuples and I always want to get the whole row? Perhaps the disk oraganization could then be optimized? sorry for offtopic array databases are maybe better for your purpose http://rasdaman.com/ http://www.scidb.org/ Thank you for feedback and ideas Best Neo -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query a table with lots of coulmns
On 09/19/2014 04:51 AM, Björn Wittich wrote: I am relatively new to postgres. I have a table with 500 coulmns and about 40 mio rows. I call this cache table where one column is a unique key (indexed) and the 499 columns (type integer) are some values belonging to this key. Now I have a second (temporary) table (only 2 columns one is the key of my cache table) and I want do an inner join between my temporary table and the large cache table and export all matching rows. I found out, that the performance increases when I limit the join to lots of small parts. But it seems that the databases needs a lot of disk io to gather all 499 data columns. Is there a possibilty to tell the databases that all these colums are always treated as tuples and I always want to get the whole row? Perhaps the disk oraganization could then be optimized? PostgreSQL is already a row store, which means by default you're getting all of the columns, and the columns are stored physically adjacent to each other. If requesting only 1 or two columns is faster than requesting all of them, that's pretty much certainly due to transmission time, not disk IO. Otherwise, please post your schema (well, a truncated version) and your queries. BTW, in cases like yours I've used a INT array instead of 500 columns to good effect; it works slightly better with PostgreSQL's compression. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance