subject:"\[PERFORM\] Performant queries on table with many boolean columns"

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-25 Thread Adam Brusselback

At that point would it be better to just use a boolean array?

Here is an example I just wrote up that does pretty damn fast searches.


SET work_mem = '256 MB';

CREATE TABLE test_bool AS
SELECT id, array_agg(random() < 0.85) as boolean_column
FROM generate_series(1, 100)
CROSS JOIN generate_series(1, 50) id
GROUP BY id;

CREATE INDEX idx_test_bool ON test_bool (boolean_column);

VACUUM ANALYZE test_bool;

SELECT *
FROM test_bool
ORDER BY  random()
LIMIT 10

SELECT id
FROM test_bool
WHERE boolean_column =
'{t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,f,t,t,t,t,f,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,f,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,f,f,t,t,t,t,t,t,t,t,t,f,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,t,f}'

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-25 Thread Merlin Moncure

On Sun, Apr 24, 2016 at 3:14 PM, bricklen  wrote:
> Query plan for the md5() index test:
>
>  Index Scan using lots_of_columns_md5_idx on lots_of_columns
> (cost=0.93..3.94 rows=1 width=208) (actual time=0.043..0.043 rows=1 loops=1)
>Index Cond: ('1ba23a0668ec17e230d98c270d6664dc'::text =
> md5(c1)::text
> || (c2)::text) || (c3)::text) || (c4)::text) || (c5)::text) || (c6)::text)
> || (c7)::text) || (c8)::text) || (c9)::text) || (c10)::text) || (c11)::text)
> || (c12)::text) || (c13)::text) || (c14)::text) || (c15)::text) ||
> (c16)::text) || (c17)::text) || (c18)::text) || (c19)::text) || (c20)::text)
> || (c21)::text) || (c22)::text) || (c23)::text) || (c24)::text) ||
> (c25)::text) || (c26)::text) || (c27)::text) || (c28)::text) || (c29)::text)
> || (c30)::text) || (c31)::text) || (c32)::text) || (c33)::text) ||
> (c34)::text) || (c35)::text) || (c36)::text) || (c37)::text) || (c38)::text)
> || (c39)::text) || (c40)::text) || (c41)::text) || (c42)::text) ||
> (c43)::text) || (c44)::text) || (c45)::text) || (c46)::text) || (c47)::text)
> || (c48)::text) || (c49)::text) || (c50)::text) || (c51)::text) ||
> (c52)::text) || (c53)::text) || (c54)::text) || (c55)::text) || (c56)::text)
> || (c57)::text) || (c58)::text) || (c59)::text) || (c60)::text) ||
> (c61)::text) || (c62)::text) || (c63)::text) || (c64)::text) || (c65)::text)
> || (c66)::text) || (c67)::text) || (c68)::text) || (c69)::text) ||
> (c70)::text) || (c71)::text) || (c72)::text) || (c73)::text) || (c74)::text)
> || (c75)::text) || (c76)::text) || (c77)::text) || (c78)::text) ||
> (c79)::text) || (c80)::text) || (c81)::text) || (c82)::text) || (c83)::text)
> || (c84)::text) || (c85)::text) || (c86)::text) || (c87)::text) ||
> (c88)::text) || (c89)::text) || (c90)::text) || (c91)::text) || (c92)::text)
> || (c93)::text) || (c94)::text) || (c95)::text) || (c96)::text) ||
> (c97)::text) || (c98)::text) || (c99)::text) || (c100)::text)))
>Buffers: shared hit=4
>  Planning time: 0.389 ms
>  Execution time: 0.129 ms
> (5 rows)

Hm.  Maybe use VARBIT? (assuming there are no null values or null can
be treated as false).

CREATE OR REPLACE FUNCTION MakeVarBit(VARIADIC BOOL[]) RETURNS VARBIT AS
$$
  SELECT string_agg(CASE WHEN v THEN '1' ELSE '0' END, '')::VARBIT
  FROM
  (
SELECT UNNEST($1) v
  ) q;
$$ LANGUAGE SQL IMMUTABLE;

postgres=# select MakeVarBit(true, true, false);
 makevarbit

 110

create index on lots_of_columns (MakeVarBit(c1, c2, c3, c4 ...));

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-24 Thread bricklen

Query plan for the md5() index test:

 Index Scan using lots_of_columns_md5_idx on lots_of_columns
(cost=0.93..3.94 rows=1 width=208) (actual time=0.043..0.043 rows=1 loops=1)
   Index Cond: ('1ba23a0668ec17e230d98c270d6664dc'::text =
md5(c1)::text
|| (c2)::text) || (c3)::text) || (c4)::text) || (c5)::text) || (c6)::text)
|| (c7)::text) || (c8)::text) || (c9)::text) || (c10)::text) ||
(c11)::text) || (c12)::text) || (c13)::text) || (c14)::text) ||
(c15)::text) || (c16)::text) || (c17)::text) || (c18)::text) ||
(c19)::text) || (c20)::text) || (c21)::text) || (c22)::text) ||
(c23)::text) || (c24)::text) || (c25)::text) || (c26)::text) ||
(c27)::text) || (c28)::text) || (c29)::text) || (c30)::text) ||
(c31)::text) || (c32)::text) || (c33)::text) || (c34)::text) ||
(c35)::text) || (c36)::text) || (c37)::text) || (c38)::text) ||
(c39)::text) || (c40)::text) || (c41)::text) || (c42)::text) ||
(c43)::text) || (c44)::text) || (c45)::text) || (c46)::text) ||
(c47)::text) || (c48)::text) || (c49)::text) || (c50)::text) ||
(c51)::text) || (c52)::text) || (c53)::text) || (c54)::text) ||
(c55)::text) || (c56)::text) || (c57)::text) || (c58)::text) ||
(c59)::text) || (c60)::text) || (c61)::text) || (c62)::text) ||
(c63)::text) || (c64)::text) || (c65)::text) || (c66)::text) ||
(c67)::text) || (c68)::text) || (c69)::text) || (c70)::text) ||
(c71)::text) || (c72)::text) || (c73)::text) || (c74)::text) ||
(c75)::text) || (c76)::text) || (c77)::text) || (c78)::text) ||
(c79)::text) || (c80)::text) || (c81)::text) || (c82)::text) ||
(c83)::text) || (c84)::text) || (c85)::text) || (c86)::text) ||
(c87)::text) || (c88)::text) || (c89)::text) || (c90)::text) ||
(c91)::text) || (c92)::text) || (c93)::text) || (c94)::text) ||
(c95)::text) || (c96)::text) || (c97)::text) || (c98)::text) ||
(c99)::text) || (c100)::text)))
   Buffers: shared hit=4
 Planning time: 0.389 ms
 Execution time: 0.129 ms
(5 rows)

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-24 Thread bricklen

On Fri, Apr 22, 2016 at 6:57 AM, Rob Imig  wrote:

> Just to followup where I'm at, I've constructed a new column which is a
> 100 bit bitstring representing all the flags. Created a b-tree index on
> that column and can now do super fast lookups (2) for specific scenarios
> however getting the behavior I need would require a huge amount of OR
> conditions (as Rick mentioned earlier). Another option is to do bitwiser
> operators (3) but that seems really slow. Not sure how I can speed that up.
>

I tried a slightly different tact - how about creating a function-based
md5() index over your columns and doing the same for you input values? For
the test I ran, I used a char datatype with two possible values: '1' (true)
and '0' (false).
The columns were named (for simplicity), c1 to c100.

eg.
create index lots_of_columns_md5_idx on lots_of_columns (
md5(c1||c2||c3||c4||c5||c6||c7||c8||c9||c10||
c11||c12||c13||c14||c15||c16||c17||c18||c19||c20||
c21||c22||c23||c24||c25||c26||c27||c28||c29||c30||
c31||c32||c33||c34||c35||c36||c37||c38||c39||c40||
c41||c42||c43||c44||c45||c46||c47||c48||c49||c50||
c51||c52||c53||c54||c55||c56||c57||c58||c59||c60||
c61||c62||c63||c64||c65||c66||c67||c68||c69||c70||
c71||c72||c73||c74||c75||c76||c77||c78||c79||c80||
c81||c82||c83||c84||c85||c86||c87||c88||c89||c90||
c91||c92||c93||c94||c95||c96||c97||c98||c99||c100)
) with (fillfactor=100);

The query then looked like:
select ...
from ...
where md5(all||the||columns) = md5(all||your||values);

The test data I fabricated wasn't necessarily 85% true as you expect your
data to be, but the tests I ran were returning results in single-digit
milliseconds for a 1M row table. The queries become a bit more difficult to
create as you need to concatenate all the values together. You could pass
the list of columns into a function to abstract that away from the query,
but that might mess with the planner.
Note that the method suggested here relies on column ordering always being
the same, otherwise the hash will be different/inaccurate.

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-22 Thread Rob Imig

Just to followup where I'm at, I've constructed a new column which is a 100
bit bitstring representing all the flags. Created a b-tree index on that
column and can now do super fast lookups (2) for specific scenarios however
getting the behavior I need would require a huge amount of OR conditions
(as Rick mentioned earlier). Another option is to do bitwiser operators (3)
but that seems really slow. Not sure how I can speed that up.

For my specific use-case I think we are going to be able to shard by a
category so performance will be acceptable, so this is turning into an
educational exercise.

*1. SELECT..WHERE on each boolean property*

rimig=# explain analyze select bitstr from bloomtest_bi where prop0 AND
prop1 AND prop2 AND prop3 AND prop4 AND prop5 AND prop6 AND prop7 AND prop8
AND prop9 AND prop10 AND prop11 AND prop12 AND prop13 AND prop14 AND prop15
AND prop16 AND prop17 AND prop18 AND prop19 AND prop20 AND prop21 AND
prop22 AND prop23 AND prop24 AND prop25 AND prop26 AND prop27 AND prop28
AND prop29 AND prop30 AND prop31 AND prop32 AND prop33 AND prop34 AND
prop35 AND prop36 AND prop37 AND prop38 AND prop39 AND prop40 AND prop41
AND prop42 AND prop43 AND prop44 AND prop45 AND prop46 AND prop47 AND
prop48 AND prop49 AND prop50 AND prop51 AND prop52 AND prop53 AND prop54
AND prop55 AND prop56 AND prop57 AND prop58 AND prop59 AND prop60 AND
prop61 AND prop62 AND prop63 AND prop64;





QUERY PLAN






--

 Seq Scan on bloomtest_bi  (cost=0.00..350770.00 rows=6 width=18) (actual
time=229.365..2576.391 rows=9 loops=1)

   Filter: (prop0 AND prop1 AND prop2 AND prop3 AND prop4 AND prop5 AND
prop6 AND prop7 AND prop8 AND prop9 AND prop10 AND prop11 AND prop12 AND
prop13 AND prop14 AND prop15 AND prop16 AND prop17 AND prop18 AND prop19
AND prop20 AND prop21 AND prop22 AND prop23 AND prop24 AND prop25 AND
prop26 AND prop27 AND prop28 AND prop29 AND prop30 AND prop31 AND prop32
AND prop33 AND prop34 AND prop35 AND prop36 AND prop37 AND prop38 AND
prop39 AND prop40 AND prop41 AND prop42 AND prop43 AND prop44 AND prop45
AND prop46 AND prop47 AND prop48 AND prop49 AND prop50 AND prop51 AND
prop52 AND prop53 AND prop54 AND prop55 AND prop56 AND prop57 AND prop58
AND prop59 AND prop60 AND prop61 AND prop62 AND prop63 AND prop64)

   Rows Removed by Filter: 1191

 Total runtime: 2576.420 ms

(4 rows)


*Time: 2577.160 ms*


*2. SELECT..WHERE on exact bitstring match* (standard b-tree index on
bitstr so obviously fast here)

This would mean I'd have to OR all the conditions which is a bit gnarly.


rimig=# explain analyze select bitstr from bloomtest_bi where bitstr =
'101101110111001100110001101000111';

   QUERY
PLAN



 Index Only Scan using i_gist on bloomtest_bi  (cost=0.56..8.58 rows=1
width=18) (actual time=0.040..0.040 rows=1 loops=1)

   Index Cond: (bitstr =
B'101101110111001100110001101000111'::bit
varying)

   Heap Fetches: 1

 Total runtime: 0.056 ms

(4 rows)


*Time: 0.443 ms*


*3. SELECT..WHERE using bitwise operator*

This gets all the results I need however it's slow.

rimig=# explain analyze select bitstr from bloomtest_bi where (bitstr &
'1'
) =
'1';


QUERY PLAN



---

 Seq Scan on bloomtest_bi  (cost=0.00..410770.00 rows=6 width=18)
(actual time=856.595..9359.566 rows=9 loops=1)

   Filter: (((bitstr)::"bit" &
B'1

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-21 Thread Rob Imig

Hey all,

Lots of interesting suggestions! I'm loving it.

Just came back to this a bit earlier today and made a sample table to see
what non-index performance would be. Constructed data just like above (used
12M rows and 80% true for all 100 boolean columns)

Here's an analyze for what I'd expect to be the types of queries that I'll
be handling from the frontend. I would expect around 40-70 properties per
query.

Now I'm going to start experimenting with some ideas above and other
tuning. This isn't as bad as I thought it would be, though would like to
get this under 200ms.

rimig=# explain analyze select count(*) from bloomtest where prop0 AND
prop1 AND prop2 AND prop3 AND prop4 AND prop5 AND prop6 AND prop7 AND prop8
AND prop9 AND prop10 AND prop11 AND prop12 AND prop13 AND prop14 AND prop15
AND prop16 AND prop17 AND prop18 AND prop19 AND prop20 AND prop21 AND
prop22 AND prop23 AND prop24 AND prop25 AND prop26 AND prop27 AND prop28
AND prop29 AND prop30 AND prop31 AND prop32 AND prop33 AND prop34 AND
prop35 AND prop36 AND prop37 AND prop38 AND prop39 AND prop40 AND prop41
AND prop42 AND prop43 AND prop44 AND prop45 AND prop46 AND prop47 AND
prop48 AND prop49 AND prop50 AND prop51 AND prop52 AND prop53 AND prop54
AND prop55 AND prop56 AND prop57 AND prop58 AND prop59 AND prop60 AND
prop61 AND prop62 AND prop63 AND prop64;

 Aggregate  (cost=351563.03..351563.04 rows=1 width=0) (actual
time=2636.829..2636.829 rows=1 loops=1)

   ->  Seq Scan on bloomtest  (cost=0.00..351563.02 rows=3 width=0) (actual
time=448.200..2636.811 rows=9 loops=1)

 Filter: (prop0 AND prop1 AND prop2 AND prop3 AND prop4 AND prop5
AND prop6 AND prop7 AND prop8 AND prop9 AND prop10 AND prop11 AND prop12
AND prop13 AND prop14 AND prop15 AND prop16 AND prop17 AND prop18 AND
prop19 AND prop20 AND prop21 AND prop22 AND prop23 AND prop24 AND prop25
AND prop26 AND prop27 AND prop28 AND prop29 AND prop30 AND prop31 AND
prop32 AND prop33 AND prop34 AND prop35 AND prop36 AND prop37 AND prop38
AND prop39 AND prop40 AND prop41 AND prop42 AND prop43 AND prop44 AND
prop45 AND prop46 AND prop47 AND prop48 AND prop49 AND prop50 AND prop51
AND prop52 AND prop53 AND prop54 AND prop55 AND prop56 AND prop57 AND
prop58 AND prop59 AND prop60 AND prop61 AND prop62 AND prop63 AND prop64)

 Rows Removed by Filter: 1191

 Total runtime: 2636.874 ms

On Thu, Apr 21, 2016 at 12:45 PM, Jeff Janes  wrote:

> On Wed, Apr 20, 2016 at 11:54 AM, Teodor Sigaev  wrote:
> >>
> >> The obvious thing seems to make a table with ~100 columns, with 1 column
> >> for each boolean property. Though, what type of indexing strategy would
> >> one use on that table? Doesn't make sense to do BTREE. Is there a better
> >> way to structure it?
> >>
> > looks like a deal for contrib/bloom index in upcoming 9.6 release
>
> Not without doing a custom compilation with an increased INDEX_MAX_KEYS:
>
> ERROR:  cannot use more than 32 columns in an index
>
> But even so, I'm skeptical this would do better than a full scan.  It
> would be interesting to test that.
>
> Cheers,
>
> Jeff
>

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-21 Thread Jeff Janes

On Wed, Apr 20, 2016 at 11:54 AM, Teodor Sigaev  wrote:
>>
>> The obvious thing seems to make a table with ~100 columns, with 1 column
>> for each boolean property. Though, what type of indexing strategy would
>> one use on that table? Doesn't make sense to do BTREE. Is there a better
>> way to structure it?
>>
> looks like a deal for contrib/bloom index in upcoming 9.6 release

Not without doing a custom compilation with an increased INDEX_MAX_KEYS:

ERROR:  cannot use more than 32 columns in an index

But even so, I'm skeptical this would do better than a full scan.  It
would be interesting to test that.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-21 Thread Jeff Janes

On Wed, Apr 20, 2016 at 11:41 AM, Rob Imig  wrote:
> Hey all,
>
> New to the lists so please let me know if this isn't the right place for
> this question.
>
> I am trying to understand how to structure a table to allow for optimal
> performance on retrieval. The data will not change frequently so you can
> basically think of it as static and only concerned about optimizing reads
> from basic SELECT...WHERE queries.
>
> The data:
>
> ~20 million records
> Each record has 1 id and ~100 boolean properties
> Each boolean property has ~85% of the records as true
>
>
> The retrieval will always be something like "SELECT id FROM  WHERE
> .
>
>  will be some arbitrary set of the ~100 boolean columns and you
> want the ids that match all of the conditions (true for each boolean
> column). Example:
> WHERE prop1 AND prop18 AND prop24

Is 3 a typical number of conditions to have?

85%^3 is 61.4%, so you are fetching most of the table.  At that point,
I think I would give up on indexes and just expect to do a full table
scan each time.   Which means a single column
bit-string data type might be the way to go, although the construction
of the queries would then be more cumbersome, especially if you will
do by hand.

I think the only way to know for sure is to write a few scripts to benchmark it.

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-20 Thread Rick Otten

Would a bit string column work? --
http://www.postgresql.org/docs/9.5/static/datatype-bit.html

You might need to use a lot of bitwise OR statements in the query though if
you are looking at very sparse sets of specific values...

Something like the get_bit() function might allow you to select a specific
bit, but then you might want a bunch of functional indexes on the column
for various get_bit() combinations.

Maybe you can group commonly queried sets of columns into bit strings.
 (rather than having one bit string column for all 100 booleans).

On Wed, Apr 20, 2016 at 2:54 PM, Teodor Sigaev  wrote:

>
>> The obvious thing seems to make a table with ~100 columns, with 1 column
>> for each boolean property. Though, what type of indexing strategy would
>> one use on that table? Doesn't make sense to do BTREE. Is there a better
>> way to structure it?
>>
>> looks like a deal for contrib/bloom index in upcoming 9.6 release
>
>
> --
> Teodor Sigaev  E-mail: teo...@sigaev.ru
>   WWW: http://www.sigaev.ru/
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: [PERFORM] Performant queries on table with many boolean columns

2016-04-20 Thread Teodor Sigaev



The obvious thing seems to make a table with ~100 columns, with 1 column
for each boolean property. Though, what type of indexing strategy would
one use on that table? Doesn't make sense to do BTREE. Is there a better
way to structure it?


looks like a deal for contrib/bloom index in upcoming 9.6 release


--
Teodor Sigaev  E-mail: teo...@sigaev.ru
  WWW: http://www.sigaev.ru/


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Performant queries on table with many boolean columns

2016-04-20 Thread Rob Imig

Hey all,

New to the lists so please let me know if this isn't the right place for
this question.

I am trying to understand how to structure a table to allow for optimal
performance on retrieval. The data will not change frequently so you can
basically think of it as static and only concerned about optimizing reads
from basic SELECT...WHERE queries.

The data:

   - ~20 million records
   - Each record has 1 id and ~100 boolean properties
   - Each boolean property has ~85% of the records as true


The retrieval will always be something like "SELECT id FROM  WHERE
.

 will be some arbitrary set of the ~100 boolean columns and you
want the ids that match all of the conditions (true for each boolean
column). Example:
WHERE prop1 AND prop18 AND prop24


The obvious thing seems to make a table with ~100 columns, with 1 column
for each boolean property. Though, what type of indexing strategy would one
use on that table? Doesn't make sense to do BTREE. Is there a better way to
structure it?


Any and all advice/tips/questions appreciated!

Thanks,
Rob

Re: [PERFORM] Performant queries on table with many boolean columns

Re: [PERFORM] Performant queries on table with many boolean columns

Re: [PERFORM] Performant queries on table with many boolean columns

Re: [PERFORM] Performant queries on table with many boolean columns

Re: [PERFORM] Performant queries on table with many boolean columns

Re: [PERFORM] Performant queries on table with many boolean columns

Re: [PERFORM] Performant queries on table with many boolean columns

Re: [PERFORM] Performant queries on table with many boolean columns

Re: [PERFORM] Performant queries on table with many boolean columns

Re: [PERFORM] Performant queries on table with many boolean columns

[PERFORM] Performant queries on table with many boolean columns

11 matches

Site Navigation

Mail list logo

Footer information