Re: [PERFORM] query a table with lots of coulmns

2014-09-20 Thread Björn Wittich

At first, thanks for your fast and comprehensive help.

The structure of my cache table is

a text , b text NOT NULL , c text , d text , e timestamp without 
timezone DEFAULT now(),  f text, s1 integer DEFAULT 0, s2 integer 
DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0


additional constraints: primary key (b)  , Unique(b), Unique(a)
Indexes : Index on a, Index on b

This table has 30 Mio rows ( will increase to 50 Mio) in future

My working table is

b text, g integer

Indexes on b and c


This table has 5 Mio rows

Scenario:

What I want to achieve :

SELECT s1,s2,s3,...s512,g,d  from worktable INNER JOIN cachetable 
USING(b) ORDER BY g



The inner join will match at least 95 % of columns of the smaller 
worktable in this example 4,75 mio rows.


Running this query takes several hours until I receive the first 
results. Query analyzing shows that the execution plan is doing 2 seq 
table scans on cache and work table.



When I divide this huge statement into

SELECT s1,s2,s3,...s512,g,d  from worktable INNER JOIN cachetable 
USING(b) WHERE g BETWEEN 1 and 1 ORDER BY g, SELECT 
s1,s2,s3,...s512,g,d  from worktable INNER JOIN cachetable USING(b) 
WHERE g BETWEEN 10001 and 2 ORDER BY g, 


(I can do this because g i unique and continous id from 1 to N)

The result is fast but fireing parallel requests (4-8 times parallel) 
slows down the retrieval.


Execution plan changes when adding BETWEEN 1 and 1 to use the indexes.



One remark which might help: overall 90 - 95 % of the s1-s512 columns 
are 0. I am only interested in columns not equals 0. Perhaps it would 
make sense to use and array of json and enumerate only values not equals 0.


Statistics on the large table:
table size: 80 GB
toast-tablesize: 37 GB
size of indexes: 17 GB


Thanks for your help and ideas

Björn





Am 19.09.2014 23:40, schrieb Josh Berkus:

On 09/19/2014 04:51 AM, Björn Wittich wrote:

I am relatively new to postgres. I have a table with 500 coulmns and
about 40 mio rows. I call this cache table where one column is a unique
key (indexed) and the 499 columns (type integer) are some values
belonging to this key.

Now I have a second (temporary) table (only 2 columns one is the key of
my cache table) and I want  do an inner join between my temporary table
and the large cache table and export all matching rows. I found out,
that the performance increases when I limit the join to lots of small
parts.
But it seems that the databases needs a lot of disk io to gather all 499
data columns.
Is there a possibilty to tell the databases that all these colums are
always treated as tuples and I always want to get the whole row? Perhaps
the disk oraganization could then be optimized?

PostgreSQL is already a row store, which means by default you're getting
all of the columns, and the columns are stored physically adjacent to
each other.

If requesting only 1 or two columns is faster than requesting all of
them, that's pretty much certainly due to transmission time, not disk
IO.  Otherwise, please post your schema (well, a truncated version) and
your queries.

BTW, in cases like yours I've used a INT array instead of 500 columns to
good effect; it works slightly better with PostgreSQL's compression.





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query a table with lots of coulmns

2014-09-20 Thread Marc Mamin
At first, thanks for your fast and comprehensive help.

The structure of my cache table is

a text , b text NOT NULL , c text , d text , e timestamp without
timezone DEFAULT now(),  f text, s1 integer DEFAULT 0, s2 integer
DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0


additional constraints: primary key (b)  , Unique(b), Unique(a)
Indexes : Index on a, Index on b

This looks redundant. e.g. you don't need a unique index on b if you already 
have a primary key on it.
Can you post the complete table definition ?

...

One remark which might help: overall 90 - 95 % of the s1-s512 columns
are 0. I am only interested in columns not equals 0. Perhaps it would
make sense to use and array of json and enumerate only values not equals 0.

Could you change that to replace 0 values with NULLs? 
This would greatly reduce your table space as Postgres is very efficient about 
NULLs storage:
It marks all null values in a bit map within the row header so you just need 
about one bit per null
instead of 4 bytes for zeros, and hence get rid of your I/O issue.

regards,

Marc Mamin

Von: pgsql-performance-ow...@postgresql.org 
[pgsql-performance-ow...@postgresql.org]quot; im Auftrag von quot;Björn 
Wittich [bjoern_witt...@gmx.de]
Gesendet: Samstag, 20. September 2014 09:19
An: Josh Berkus; pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] query a table with lots of coulmns

At first, thanks for your fast and comprehensive help.

The structure of my cache table is

a text , b text NOT NULL , c text , d text , e timestamp without
timezone DEFAULT now(),  f text, s1 integer DEFAULT 0, s2 integer
DEFAULT 0, s3 integer DEFAULT 0, ... ,s512 DEFAULT 0

additional constraints: primary key (b)  , Unique(b), Unique(a)
Indexes : Index on a, Index on b

This table has 30 Mio rows ( will increase to 50 Mio) in future

My working table is

b text, g integer

Indexes on b and c


This table has 5 Mio rows

Scenario:

What I want to achieve :

SELECT s1,s2,s3,...s512,g,d  from worktable INNER JOIN cachetable
USING(b) ORDER BY g


The inner join will match at least 95 % of columns of the smaller
worktable in this example 4,75 mio rows.

Running this query takes several hours until I receive the first
results. Query analyzing shows that the execution plan is doing 2 seq
table scans on cache and work table.


When I divide this huge statement into

SELECT s1,s2,s3,...s512,g,d  from worktable INNER JOIN cachetable
USING(b) WHERE g BETWEEN 1 and 1 ORDER BY g, SELECT
s1,s2,s3,...s512,g,d  from worktable INNER JOIN cachetable USING(b)
WHERE g BETWEEN 10001 and 2 ORDER BY g, 

(I can do this because g i unique and continous id from 1 to N)

The result is fast but fireing parallel requests (4-8 times parallel)
slows down the retrieval.

Execution plan changes when adding BETWEEN 1 and 1 to use the indexes.



One remark which might help: overall 90 - 95 % of the s1-s512 columns
are 0. I am only interested in columns not equals 0. Perhaps it would
make sense to use and array of json and enumerate only values not equals 0.

Statistics on the large table:
table size: 80 GB
toast-tablesize: 37 GB
size of indexes: 17 GB


Thanks for your help and ideas

Björn





Am 19.09.2014 23:40, schrieb Josh Berkus:
 On 09/19/2014 04:51 AM, Björn Wittich wrote:
 I am relatively new to postgres. I have a table with 500 coulmns and
 about 40 mio rows. I call this cache table where one column is a unique
 key (indexed) and the 499 columns (type integer) are some values
 belonging to this key.

 Now I have a second (temporary) table (only 2 columns one is the key of
 my cache table) and I want  do an inner join between my temporary table
 and the large cache table and export all matching rows. I found out,
 that the performance increases when I limit the join to lots of small
 parts.
 But it seems that the databases needs a lot of disk io to gather all 499
 data columns.
 Is there a possibilty to tell the databases that all these colums are
 always treated as tuples and I always want to get the whole row? Perhaps
 the disk oraganization could then be optimized?
 PostgreSQL is already a row store, which means by default you're getting
 all of the columns, and the columns are stored physically adjacent to
 each other.

 If requesting only 1 or two columns is faster than requesting all of
 them, that's pretty much certainly due to transmission time, not disk
 IO.  Otherwise, please post your schema (well, a truncated version) and
 your queries.

 BTW, in cases like yours I've used a INT array instead of 500 columns to
 good effect; it works slightly better with PostgreSQL's compression.




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:

Re: [PERFORM] query a table with lots of coulmns

2014-09-20 Thread Björn Wittich

Hi,

ok here are my schemata : cachetable : 30 - 50 Mio rows, worktable 5 Mio 
- 25 Mio



CREATE TABLE cachetable
(
  a text,
  b text NOT NULL,
  c text,
  d text,
  e timestamp without time zone DEFAULT now(),
  f text,
  s1 integer DEFAULT 0,
  s2 integer DEFAULT 0,
  s3 integer DEFAULT 0,
  s4 integer DEFAULT 0,
  s5 integer DEFAULT 0,
  s6 integer DEFAULT 0,
  s7 integer DEFAULT 0,
  s8 integer DEFAULT 0,
  s9 integer DEFAULT 0,
  s10 integer DEFAULT 0,
  s11 integer DEFAULT 0,
  s12 integer DEFAULT 0,
  s13 integer DEFAULT 0,
  s14 integer DEFAULT 0,
  s15 integer DEFAULT 0,
  s16 integer DEFAULT 0,
  s17 integer DEFAULT 0,
  s18 integer DEFAULT 0,
  s19 integer DEFAULT 0,
  s20 integer DEFAULT 0,
  s21 integer DEFAULT 0,
  s22 integer DEFAULT 0,
  s23 integer DEFAULT 0,
  s24 integer DEFAULT 0,
  s25 integer DEFAULT 0,
  s26 integer DEFAULT 0,
  s27 integer DEFAULT 0,
  s28 integer DEFAULT 0,
  s29 integer DEFAULT 0,
  s30 integer DEFAULT 0,
  s31 integer DEFAULT 0,
  s32 integer DEFAULT 0,
  s33 integer DEFAULT 0,
  s34 integer DEFAULT 0,
  s35 integer DEFAULT 0,
  s36 integer DEFAULT 0,
  s37 integer DEFAULT 0,
  s38 integer DEFAULT 0,
  s39 integer DEFAULT 0,
  s40 integer DEFAULT 0,
  s41 integer DEFAULT 0,
  s42 integer DEFAULT 0,
  s43 integer DEFAULT 0,
  s44 integer DEFAULT 0,
  s45 integer DEFAULT 0,
  s46 integer DEFAULT 0,
  s47 integer DEFAULT 0,
  s48 integer DEFAULT 0,
  s49 integer DEFAULT 0,
  s50 integer DEFAULT 0,
  s51 integer DEFAULT 0,
  s52 integer DEFAULT 0,
  s53 integer DEFAULT 0,
  s54 integer DEFAULT 0,
  s55 integer DEFAULT 0,
  s56 integer DEFAULT 0,
  s57 integer DEFAULT 0,
  s58 integer DEFAULT 0,
  s59 integer DEFAULT 0,
  s60 integer DEFAULT 0,
  s61 integer DEFAULT 0,
  s62 integer DEFAULT 0,
  s63 integer DEFAULT 0,
  s64 integer DEFAULT 0,
  s65 integer DEFAULT 0,
  s66 integer DEFAULT 0,
  s67 integer DEFAULT 0,
  s68 integer DEFAULT 0,
  s69 integer DEFAULT 0,
  s70 integer DEFAULT 0,
  s71 integer DEFAULT 0,
  s72 integer DEFAULT 0,
  s73 integer DEFAULT 0,
  s74 integer DEFAULT 0,
  s75 integer DEFAULT 0,
  s76 integer DEFAULT 0,
  s77 integer DEFAULT 0,
  s78 integer DEFAULT 0,
  s79 integer DEFAULT 0,
  s80 integer DEFAULT 0,
  s81 integer DEFAULT 0,
  s82 integer DEFAULT 0,
  s83 integer DEFAULT 0,
  s84 integer DEFAULT 0,
  s85 integer DEFAULT 0,
  s86 integer DEFAULT 0,
  s87 integer DEFAULT 0,
  s88 integer DEFAULT 0,
  s89 integer DEFAULT 0,
  s90 integer DEFAULT 0,
  s91 integer DEFAULT 0,
  s92 integer DEFAULT 0,
  s93 integer DEFAULT 0,
  s94 integer DEFAULT 0,
  s95 integer DEFAULT 0,
  s96 integer DEFAULT 0,
  s97 integer DEFAULT 0,
  s98 integer DEFAULT 0,
  s99 integer DEFAULT 0,
  s100 integer DEFAULT 0,
  s101 integer DEFAULT 0,
  s102 integer DEFAULT 0,
  s103 integer DEFAULT 0,
  s104 integer DEFAULT 0,
  s105 integer DEFAULT 0,
  s106 integer DEFAULT 0,
  s107 integer DEFAULT 0,
  s108 integer DEFAULT 0,
  s109 integer DEFAULT 0,
  s110 integer DEFAULT 0,
  s111 integer DEFAULT 0,
  s112 integer DEFAULT 0,
  s113 integer DEFAULT 0,
  s114 integer DEFAULT 0,
  s115 integer DEFAULT 0,
  s116 integer DEFAULT 0,
  s117 integer DEFAULT 0,
  s118 integer DEFAULT 0,
  s119 integer DEFAULT 0,
  s120 integer DEFAULT 0,
  s121 integer DEFAULT 0,
  s122 integer DEFAULT 0,
  s123 integer DEFAULT 0,
  s124 integer DEFAULT 0,
  s125 integer DEFAULT 0,
  s126 integer DEFAULT 0,
  s127 integer DEFAULT 0,
  s128 integer DEFAULT 0,
  s129 integer DEFAULT 0,
  s130 integer DEFAULT 0,
  s131 integer DEFAULT 0,
  s132 integer DEFAULT 0,
  s133 integer DEFAULT 0,
  s134 integer DEFAULT 0,
  s135 integer DEFAULT 0,
  s136 integer DEFAULT 0,
  s137 integer DEFAULT 0,
  s138 integer DEFAULT 0,
  s139 integer DEFAULT 0,
  s140 integer DEFAULT 0,
  s141 integer DEFAULT 0,
  s142 integer DEFAULT 0,
  s143 integer DEFAULT 0,
  s144 integer DEFAULT 0,
  s145 integer DEFAULT 0,
  s146 integer DEFAULT 0,
  s147 integer DEFAULT 0,
  s148 integer DEFAULT 0,
  s149 integer DEFAULT 0,
  s150 integer DEFAULT 0,
  s151 integer DEFAULT 0,
  s152 integer DEFAULT 0,
  s153 integer DEFAULT 0,
  s154 integer DEFAULT 0,
  s155 integer DEFAULT 0,
  s156 integer DEFAULT 0,
  s157 integer DEFAULT 0,
  s158 integer DEFAULT 0,
  s159 integer DEFAULT 0,
  s160 integer DEFAULT 0,
  s161 integer DEFAULT 0,
  s162 integer DEFAULT 0,
  s163 integer DEFAULT 0,
  s164 integer DEFAULT 0,
  s165 integer DEFAULT 0,
  s166 integer DEFAULT 0,
  s167 integer DEFAULT 0,
  s168 integer DEFAULT 0,
  s169 integer DEFAULT 0,
  s170 integer DEFAULT 0,
  s171 integer DEFAULT 0,
  s172 integer DEFAULT 0,
  s173 integer DEFAULT 0,
  s174 integer DEFAULT 0,
  s175 integer DEFAULT 0,
  s176 integer DEFAULT 0,
  s177 integer DEFAULT 0,
  s178 integer DEFAULT 0,
  s179 integer DEFAULT 0,
  s180 integer DEFAULT 0,
  s181 integer DEFAULT 0,
  s182 integer DEFAULT 0,
  s183 integer DEFAULT 0,
  s184 integer DEFAULT 0,
  s185 integer DEFAULT 0,
  s186 integer DEFAULT 0,
  s187 integer DEFAULT 0,
  s188 integer 

Re: [PERFORM] query a table with lots of coulmns

2014-09-19 Thread Szymon Guz
On 19 September 2014 13:51, Björn Wittich bjoern_witt...@gmx.de wrote:

 Hi mailing list,

 I am relatively new to postgres. I have a table with 500 coulmns and about
 40 mio rows. I call this cache table where one column is a unique key
 (indexed) and the 499 columns (type integer) are some values belonging to
 this key.

 Now I have a second (temporary) table (only 2 columns one is the key of my
 cache table) and I want  do an inner join between my temporary table and
 the large cache table and export all matching rows. I found out, that the
 performance increases when I limit the join to lots of small parts.
 But it seems that the databases needs a lot of disk io to gather all 499
 data columns.
 Is there a possibilty to tell the databases that all these colums are
 always treated as tuples and I always want to get the whole row? Perhaps
 the disk oraganization could then be optimized?


Hi,
do you have indexes on the columns you use for joins?

Szymon


Re: [PERFORM] query a table with lots of coulmns

2014-09-19 Thread Björn Wittich

Hi Szymon,

yes I have indexes on both columns (one in each table) which I am using 
for join operation.


Am 19.09.2014 14:04, schrieb Szymon Guz:



On 19 September 2014 13:51, Björn Wittich bjoern_witt...@gmx.de 
mailto:bjoern_witt...@gmx.de wrote:


Hi mailing list,

I am relatively new to postgres. I have a table with 500 coulmns
and about 40 mio rows. I call this cache table where one column is
a unique key (indexed) and the 499 columns (type integer) are some
values belonging to this key.

Now I have a second (temporary) table (only 2 columns one is the
key of my cache table) and I want  do an inner join between my
temporary table and the large cache table and export all matching
rows. I found out, that the performance increases when I limit the
join to lots of small parts.
But it seems that the databases needs a lot of disk io to gather
all 499 data columns.
Is there a possibilty to tell the databases that all these colums
are always treated as tuples and I always want to get the whole
row? Perhaps the disk oraganization could then be optimized?

Hi,
do you have indexes on the columns you use for joins?

Szymon




Re: [PERFORM] query a table with lots of coulmns

2014-09-19 Thread Pavel Stehule
2014-09-19 13:51 GMT+02:00 Björn Wittich bjoern_witt...@gmx.de:

 Hi mailing list,

 I am relatively new to postgres. I have a table with 500 coulmns and about
 40 mio rows. I call this cache table where one column is a unique key
 (indexed) and the 499 columns (type integer) are some values belonging to
 this key.

 Now I have a second (temporary) table (only 2 columns one is the key of my
 cache table) and I want  do an inner join between my temporary table and
 the large cache table and export all matching rows. I found out, that the
 performance increases when I limit the join to lots of small parts.
 But it seems that the databases needs a lot of disk io to gather all 499
 data columns.
 Is there a possibilty to tell the databases that all these colums are
 always treated as tuples and I always want to get the whole row? Perhaps
 the disk oraganization could then be optimized?


sorry for offtopic

array databases are maybe better for your purpose

http://rasdaman.com/
http://www.scidb.org/




 Thank you for feedback and ideas
 Best
 Neo


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] query a table with lots of coulmns

2014-09-19 Thread Josh Berkus
On 09/19/2014 04:51 AM, Björn Wittich wrote:
 
 I am relatively new to postgres. I have a table with 500 coulmns and
 about 40 mio rows. I call this cache table where one column is a unique
 key (indexed) and the 499 columns (type integer) are some values
 belonging to this key.
 
 Now I have a second (temporary) table (only 2 columns one is the key of
 my cache table) and I want  do an inner join between my temporary table
 and the large cache table and export all matching rows. I found out,
 that the performance increases when I limit the join to lots of small
 parts.
 But it seems that the databases needs a lot of disk io to gather all 499
 data columns.
 Is there a possibilty to tell the databases that all these colums are
 always treated as tuples and I always want to get the whole row? Perhaps
 the disk oraganization could then be optimized?

PostgreSQL is already a row store, which means by default you're getting
all of the columns, and the columns are stored physically adjacent to
each other.

If requesting only 1 or two columns is faster than requesting all of
them, that's pretty much certainly due to transmission time, not disk
IO.  Otherwise, please post your schema (well, a truncated version) and
your queries.

BTW, in cases like yours I've used a INT array instead of 500 columns to
good effect; it works slightly better with PostgreSQL's compression.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance