[PERFORM] How to tell ANALYZE to collect statistics from the whole table?

2015-01-24 Thread AlexK987
The documentation states that "The extent of analysis can be controlled by
adjusting the default_statistics_target configuration variable". It looks
like I can tell Postgres to create more histograms with more bins, and more
distinct values. This implicitly means that Postgres will use a larger
random subset to calculate statistics. 

However, this is not what I want. My data may be quite skewed, and I want
full control over the size of the sample. I want to explicitly tell Postgres
to analyze the whole table. How can I accomplish that?



--
View this message in context: 
http://postgresql.nabble.com/How-to-tell-ANALYZE-to-collect-statistics-from-the-whole-table-tp5835339.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to tell ANALYZE to collect statistics from the whole table?

2015-01-24 Thread Tomas Vondra
Hi,

On 25.1.2015 00:33, AlexK987 wrote:
> The documentation states that "The extent of analysis can be 
> controlled by adjusting the default_statistics_target configuration 
> variable". It looks like I can tell Postgres to create more 
> histograms with more bins, and more distinct values. This implicitly
> means that Postgres will use a larger random subset to calculate
> statistics.
> 
> However, this is not what I want. My data may be quite skewed, and I 
> want full control over the size of the sample. I want to explicitly 
> tell Postgres to analyze the whole table. How can I accomplish that?

I don't think there's an official way to do that - at least I can't
think of one. The only thing you can do is increasing statistics target
(either globally by setting default_statistics_target, or per column
using ALTER TABLE ... SET STATISTICS).

As you noticed, this however controls two things - sample size and how
detailed the statistics (MCV list / histogram) will be. The statistics
target is used as upper bound for number of MCV items / histogram bins,
and the number of sampled rows is (300 * statistics_target). With
default_statistics_target = 1 (which si the max allowed value since
9.0), this produces very detailed stats and uses sample of ~3M rows.

It's a bit more complicated though, because there's an algorithm that
decides how many MCV items / histogram buckets to actually create, based
on the data. So you may not get more detailed stats, even when using
larger sample.

That being said, I really doubt increasing the statistics target above
1 (or even sampling the whole table) will help you in practice.
Might be worth showing an example of a bad estimate with your data, or
maybe a test case to play with.

-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to tell ANALYZE to collect statistics from the whole table?

2015-01-24 Thread AlexK987
Tomas,

Thank you for a very useful reply. Right now I do not have a case of poor
performance caused by strong data skew which is not properly reflected in
statistics. I was being defensive, trying to prevent every possible thing
that might go wrong.



--
View this message in context: 
http://postgresql.nabble.com/How-to-tell-ANALYZE-to-collect-statistics-from-the-whole-table-tp5835339p5835344.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to tell ANALYZE to collect statistics from the whole table?

2015-01-24 Thread Tomas Vondra
On 25.1.2015 02:04, AlexK987 wrote:
> Tomas,
> 
> Thank you for a very useful reply. Right now I do not have a case of
> poor performance caused by strong data skew which is not properly
> reflected in statistics. I was being defensive, trying to prevent
> every possible thing that might go wrong.

OK. My recommendation is not to mess with default_statistics unless you
actually have to (e.g. increasing the value on all tables, withouth a
query where the current value causes trouble). It increases time to plan
the queries, collect statistics (ANALYZE / autovacuum) etc.

regards
-- 
Tomas Vondrahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Initial insert

2015-01-24 Thread Tomas Vondra
Hi,

On 22.1.2015 17:46, Laurent Cathala wrote:
> Hi, 
> I'm trying to create datas on an initial import and i'm encountering a
> performance issue.
> I've 2 tables, my process create a record in each table and execute a
> sum with join on this 2 tables. (and other requests but there are very fast)
> 
> My 2 tables are empty before the import.
> 
> My count query is :
> select sum(quantitest0_.quantite_valeur) as col_0_0_ from
> dm5_quantitestock quantitest0_, dm5_caracteristiquearticlestock
> caracteris1_ where
> quantitest0_.id_caracteristiquearticlestock=caracteris1_.id and
> caracteris1_.id_article='4028804c4a311178014a346546967c59'
> 
> i use parameterized request.
> 
> My process create only 6000 records in each table.
> 
> During the whole process this sum request lasts longer and longer.
> 
> The auto-explain plan show an seq scan 
> 
> --
>   Query Text: select sum(quantitest0_.quantite_valeur) as col_0_0_ from
> dm5_quantitestock quantitest0_, dm5_caracteristiquearticlestock
> caracteris1_ where
> quantitest0_.id_caracteristiquearticlestock=caracteris1_.id and
> caracteris1_.id_article=$1
> Aggregate  (cost=2.04..2.05 rows=1 width=26) (actual
> time=862.621..862.621 rows=1 loops=1)
>   Output: sum(quantitest0_.quantite_valeur)
>   ->  Nested Loop  (cost=0.00..2.04 rows=1 width=26) (actual
> time=862.618..862.618 rows=0 loops=1)
> Output: quantitest0_.quantite_valeur
> Join Filter:
> ((quantitest0_.id_caracteristiquearticlestock)::text =
> (caracteris1_.id)::text)
> Rows Removed by Join Filter: 1869
> ->  Seq Scan on public.dm5_quantitestock quantitest0_
>  (cost=0.00..1.01 rows=1 width=164) (actual time=0.004..0.408 rows=1869
> loops=1)
>   Output: quantitest0_.id,
> quantitest0_.datefinvalidite, quantitest0_.quantite_valeur,
> quantitest0_.id_caracteristiquearticlestock,
> quantitest0_.id_caracteristiquelieustock,
> quantitest0_.datecreationsysteme, quantitest0_.datemodificationsysteme,
> quantitest0_.id_creeparsysteme, quantitest0_.id_modifieparsysteme
> ->  Seq Scan on public.dm5_caracteristiquearticlestock
> caracteris1_  (cost=0.00..1.01 rows=1 width=42) (actual
> time=0.456..0.456 rows=1 loops=1869)
>   Output: caracteris1_.id,
> caracteris1_.datefinvalidite, caracteris1_.id_lot,
> caracteris1_.id_article, caracteris1_.id_numeroserie,
> caracteris1_.datecreationsysteme, caracteris1_.datemodificationsysteme,
> caracteris1_.id_modifieparsysteme, caracteris1_.id_creeparsysteme
>   Filter: ((caracteris1_.id_article)::text = ($1)::text)
>   Rows Removed by Filter: 1869
> ---
> 
> if a launch an analyse during the process, the explain use index, but
> the time remains the same.
> 
> -
> Query Text: select sum(quantitest0_.quantite_valeur) as col_0_0_ from
> dm5_quantitestock quantitest0_, dm5_caracteristiquearticlestock
> caracteris1_ where
> quantitest0_.id_caracteristiquearticlestock=caracteris1_.id and
> caracteris1_.id_article=$1
> Aggregate  (cost=16.55..16.56 rows=1 width=26) (actual
> time=654.998..654.998 rows=1 loops=1)
>  Output: sum(quantitest0_.quantite_valeur)
>  ->  Nested Loop  (cost=0.00..16.55 rows=1 width=26) (actual
> time=654.994..654.994 rows=0 loops=1)
>Output: quantitest0_.quantite_valeur
>Join Filter: ((quantitest0_.id_caracteristiquearticlestock)::text
> = (caracteris1_.id)::text)
>Rows Removed by Join Filter: 1651
>->  Index Scan using x_dm5_quantitestock_00 on
> public.dm5_quantitestock quantitest0_  (cost=0.00..8.27 rows=1
> width=164) (actual time=0.011..0.579 rows=1651 loops=1)
>  Output: quantitest0_.id, quantitest0_.datefinvalidite,
> quantitest0_.quantite_valeur,
> quantitest0_.id_caracteristiquearticlestock,
> quantitest0_.id_caracteristiquelieustock,
> quantitest0_.datecreationsysteme, quantitest0_.datemodificationsysteme,
> quantitest0_.id_creeparsysteme, quantitest0_.id_modifieparsysteme
>->  Index Scan using dm5_caracteristiquearticlestock_pkey on
> public.dm5_caracteristiquearticlestock caracteris1_  (cost=0.00..8.27
> rows=1 width=42) (actual time=0.395..0.395 rows=1 loops=1651)
>  Output: caracteris1_.id, caracteris1_.datefinvalidite,
> caracteris1_.id_lot, caracteris1_.id_article,
> caracteris1_.id_numeroserie, caracteris1_.datecreationsysteme,
> caracteris1_.datemodificationsysteme, caracteris1_.id_modifieparsysteme,
> caracteris1_.id_creeparsysteme
>  Filter: ((caracteris1_.id_article)::text =
> '4028804c4a311178014a346547307cce'::text)
>  Rows Removed by Filter: 1651
> 
> --

Why is the first query using a parameter ($1) while the second one uses
a string literal? Have you executed them differently?

> 
> If i create the first 1000 records, commit and end transaction, the 
> whole import is very fast.

And what plans do the queries use?

> 
> I can't cha

Re: [PERFORM] How to tell ANALYZE to collect statistics from the whole table?

2015-01-24 Thread Tom Lane
AlexK987  writes:
> The documentation states that "The extent of analysis can be controlled by
> adjusting the default_statistics_target configuration variable". It looks
> like I can tell Postgres to create more histograms with more bins, and more
> distinct values. This implicitly means that Postgres will use a larger
> random subset to calculate statistics. 

> However, this is not what I want. My data may be quite skewed, and I want
> full control over the size of the sample. I want to explicitly tell Postgres
> to analyze the whole table. How can I accomplish that?

You can't, and you wouldn't want to if you could, because that would
result in slurping the entire table into backend local memory.  All
the rows constituting the "random sample" are held in memory while
doing the statistical calculations.

In practice, the only stat that would be materially improved by taking
enormously large samples would be the number-of-distinct-values estimate.
There's already a way you can override ANALYZE's estimate of that number
if you need to.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk
I have an events table that records page views and purchases (type =
'viewed' or type='purchased'). I have a query that figures out "people who
bought/viewed this also bought/viewed that".

It worked fine, taking about 0.1 seconds to complete, until a few hours ago
when it started taking hours to complete. Vacuum/analyze didn't help.
Turned out there was one session_id that had 400k rows in the system.
Deleting that made the query performant again.

Is there anything I can do to make the query work better in cases like
that? Missing index, or better query?

This is on 9.3.5.

The below is reproduced at the following URL if it's not formatted
correctly in the email.
https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

explain  select
   e1.product_id,
   e2.site_id,
   e2.product_id,
   count(nullif(e2.type='viewed', false)) view_count,
   count(nullif(e2.type='purchased', false)) purchase_count
 from events e1
 join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
 where
   e1.product_id = '82503' and
   e1.product_id != e2.product_id
 group by e1.product_id, e2.product_id, e2.site_id;
 QUERY PLAN

 GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
   ->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
 Sort Key: e1.product_id, e2.product_id, e2.site_id
 ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
   ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31
rows=369 width=49)
 Recheck Cond: (product_id = '82503'::citext)
 ->  Bitmap Index Scan on
events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
   Index Cond: (product_id = '82503'::citext)
   ->  Index Scan using
events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28
rows=12 width=51)
 Index Cond: ((session_id = e1.session_id) AND
(type = e1.type))
 Filter: (e1.product_id <> product_id)
(11 rows)

recommender_production=> \d events
Table "public.events"
   Column|   Type   |  Modifiers
-+--+-
 id  | bigint   | not null default
nextval('events_id_seq'::regclass)
 user_id | citext   |
 session_id  | citext   | not null
 product_id  | citext   | not null
 site_id | citext   | not null
 type| text | not null
 happened_at | timestamp with time zone | not null
 created_at  | timestamp with time zone | not null
Indexes:
"events_pkey" PRIMARY KEY, btree (id)
"events_product_id_site_id_idx" btree (product_id, site_id)
"events_session_id_type_product_id_idx" btree (session_id, type, product_id)
Check constraints:
"events_session_id_check" CHECK (length(session_id::text) < 255)
"events_type_check" CHECK (type = ANY (ARRAY['purchased'::text,
'viewed'::text]))
"events_user_id_check" CHECK (length(user_id::text) < 255)


Re: [PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk
On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:

> I have an events table that records page views and purchases (type =
> 'viewed' or type='purchased'). I have a query that figures out "people who
> bought/viewed this also bought/viewed that".
>
> It worked fine, taking about 0.1 seconds to complete, until a few hours
> ago when it started taking hours to complete. Vacuum/analyze didn't help.
> Turned out there was one session_id that had 400k rows in the system.
> Deleting that made the query performant again.
>
> Is there anything I can do to make the query work better in cases like
> that? Missing index, or better query?
>
> This is on 9.3.5.
>
> The below is reproduced at the following URL if it's not formatted
> correctly in the email.
> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>
> explain  select
>e1.product_id,
>e2.site_id,
>e2.product_id,
>count(nullif(e2.type='viewed', false)) view_count,
>count(nullif(e2.type='purchased', false)) purchase_count
>  from events e1
>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>  where
>e1.product_id = '82503' and
>e1.product_id != e2.product_id
>  group by e1.product_id, e2.product_id, e2.site_id;
>  QUERY PLAN
> 
>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
> rows=369 width=49)
>  Recheck Cond: (product_id = '82503'::citext)
>  ->  Bitmap Index Scan on events_product_id_site_id_idx  
> (cost=0.00..11.20 rows=369 width=0)
>Index Cond: (product_id = '82503'::citext)
>->  Index Scan using events_session_id_type_product_id_idx on 
> events e2  (cost=0.56..51.28 rows=12 width=51)
>  Index Cond: ((session_id = e1.session_id) AND (type = 
> e1.type))
>  Filter: (e1.product_id <> product_id)
> (11 rows)
>
> recommender_production=> \d events
> Table "public.events"
>Column|   Type   |  Modifiers
> -+--+-
>  id  | bigint   | not null default 
> nextval('events_id_seq'::regclass)
>  user_id | citext   |
>  session_id  | citext   | not null
>  product_id  | citext   | not null
>  site_id | citext   | not null
>  type| text | not null
>  happened_at | timestamp with time zone | not null
>  created_at  | timestamp with time zone | not null
> Indexes:
> "events_pkey" PRIMARY KEY, btree (id)
> "events_product_id_site_id_idx" btree (product_id, site_id)
> "events_session_id_type_product_id_idx" btree (session_id, type, 
> product_id)
> Check constraints:
> "events_session_id_check" CHECK (length(session_id::text) < 255)
> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
> 'viewed'::text]))
> "events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>
After removing the session with 400k events, I was able to do an explain
analyze, here is one of them:
http://explain.depesz.com/s/PFNk


Re: [PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk
Oops, didn't run vacuum analyze after deleting the events. Here is another
'explain analyze': http://explain.depesz.com/s/AviN

On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:

> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:
>
>> I have an events table that records page views and purchases (type =
>> 'viewed' or type='purchased'). I have a query that figures out "people who
>> bought/viewed this also bought/viewed that".
>>
>> It worked fine, taking about 0.1 seconds to complete, until a few hours
>> ago when it started taking hours to complete. Vacuum/analyze didn't help.
>> Turned out there was one session_id that had 400k rows in the system.
>> Deleting that made the query performant again.
>>
>> Is there anything I can do to make the query work better in cases like
>> that? Missing index, or better query?
>>
>> This is on 9.3.5.
>>
>> The below is reproduced at the following URL if it's not formatted
>> correctly in the email.
>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>>
>> explain  select
>>e1.product_id,
>>e2.site_id,
>>e2.product_id,
>>count(nullif(e2.type='viewed', false)) view_count,
>>count(nullif(e2.type='purchased', false)) purchase_count
>>  from events e1
>>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>>  where
>>e1.product_id = '82503' and
>>e1.product_id != e2.product_id
>>  group by e1.product_id, e2.product_id, e2.site_id;
>>  QUERY PLAN
>> 
>>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
>> rows=369 width=49)
>>  Recheck Cond: (product_id = '82503'::citext)
>>  ->  Bitmap Index Scan on events_product_id_site_id_idx  
>> (cost=0.00..11.20 rows=369 width=0)
>>Index Cond: (product_id = '82503'::citext)
>>->  Index Scan using events_session_id_type_product_id_idx on 
>> events e2  (cost=0.56..51.28 rows=12 width=51)
>>  Index Cond: ((session_id = e1.session_id) AND (type = 
>> e1.type))
>>  Filter: (e1.product_id <> product_id)
>> (11 rows)
>>
>> recommender_production=> \d events
>> Table "public.events"
>>Column|   Type   |  Modifiers
>> -+--+-
>>  id  | bigint   | not null default 
>> nextval('events_id_seq'::regclass)
>>  user_id | citext   |
>>  session_id  | citext   | not null
>>  product_id  | citext   | not null
>>  site_id | citext   | not null
>>  type| text | not null
>>  happened_at | timestamp with time zone | not null
>>  created_at  | timestamp with time zone | not null
>> Indexes:
>> "events_pkey" PRIMARY KEY, btree (id)
>> "events_product_id_site_id_idx" btree (product_id, site_id)
>> "events_session_id_type_product_id_idx" btree (session_id, type, 
>> product_id)
>> Check constraints:
>> "events_session_id_check" CHECK (length(session_id::text) < 255)
>> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
>> 'viewed'::text]))
>> "events_user_id_check" CHECK (length(user_id::text) < 255)
>>
>>
>>
>>
> After removing the session with 400k events, I was able to do an explain
> analyze, here is one of them:
> http://explain.depesz.com/s/PFNk
>


Re: [PERFORM] Query performance

2015-01-24 Thread Pavel Stehule
Hi

this plan looks well

Regards

Pavel

2015-01-25 6:45 GMT+01:00 Joe Van Dyk :

> Oops, didn't run vacuum analyze after deleting the events. Here is another
> 'explain analyze': http://explain.depesz.com/s/AviN
>
> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:
>
>> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:
>>
>>> I have an events table that records page views and purchases (type =
>>> 'viewed' or type='purchased'). I have a query that figures out "people who
>>> bought/viewed this also bought/viewed that".
>>>
>>> It worked fine, taking about 0.1 seconds to complete, until a few hours
>>> ago when it started taking hours to complete. Vacuum/analyze didn't help.
>>> Turned out there was one session_id that had 400k rows in the system.
>>> Deleting that made the query performant again.
>>>
>>> Is there anything I can do to make the query work better in cases like
>>> that? Missing index, or better query?
>>>
>>> This is on 9.3.5.
>>>
>>> The below is reproduced at the following URL if it's not formatted
>>> correctly in the email.
>>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>>>
>>> explain  select
>>>e1.product_id,
>>>e2.site_id,
>>>e2.product_id,
>>>count(nullif(e2.type='viewed', false)) view_count,
>>>count(nullif(e2.type='purchased', false)) purchase_count
>>>  from events e1
>>>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>>>  where
>>>e1.product_id = '82503' and
>>>e1.product_id != e2.product_id
>>>  group by e1.product_id, e2.product_id, e2.site_id;
>>>  QUERY PLAN
>>> 
>>>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>>>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>>>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>>>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>>>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
>>> rows=369 width=49)
>>>  Recheck Cond: (product_id = '82503'::citext)
>>>  ->  Bitmap Index Scan on events_product_id_site_id_idx 
>>>  (cost=0.00..11.20 rows=369 width=0)
>>>Index Cond: (product_id = '82503'::citext)
>>>->  Index Scan using events_session_id_type_product_id_idx 
>>> on events e2  (cost=0.56..51.28 rows=12 width=51)
>>>  Index Cond: ((session_id = e1.session_id) AND (type = 
>>> e1.type))
>>>  Filter: (e1.product_id <> product_id)
>>> (11 rows)
>>>
>>> recommender_production=> \d events
>>> Table "public.events"
>>>Column|   Type   |  Modifiers
>>> -+--+-
>>>  id  | bigint   | not null default 
>>> nextval('events_id_seq'::regclass)
>>>  user_id | citext   |
>>>  session_id  | citext   | not null
>>>  product_id  | citext   | not null
>>>  site_id | citext   | not null
>>>  type| text | not null
>>>  happened_at | timestamp with time zone | not null
>>>  created_at  | timestamp with time zone | not null
>>> Indexes:
>>> "events_pkey" PRIMARY KEY, btree (id)
>>> "events_product_id_site_id_idx" btree (product_id, site_id)
>>> "events_session_id_type_product_id_idx" btree (session_id, type, 
>>> product_id)
>>> Check constraints:
>>> "events_session_id_check" CHECK (length(session_id::text) < 255)
>>> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
>>> 'viewed'::text]))
>>> "events_user_id_check" CHECK (length(user_id::text) < 255)
>>>
>>>
>>>
>>>
>> After removing the session with 400k events, I was able to do an explain
>> analyze, here is one of them:
>> http://explain.depesz.com/s/PFNk
>>
>
>


Re: [PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk
On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule 
wrote:

> Hi
>
> this plan looks well
>
> Regards
>
> Pavel
>

Here's one that's not quite as well: http://explain.depesz.com/s/SgT

Joe


>
> 2015-01-25 6:45 GMT+01:00 Joe Van Dyk :
>
>> Oops, didn't run vacuum analyze after deleting the events. Here is
>> another 'explain analyze': http://explain.depesz.com/s/AviN
>>
>> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:
>>
>>> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:
>>>
 I have an events table that records page views and purchases (type =
 'viewed' or type='purchased'). I have a query that figures out "people who
 bought/viewed this also bought/viewed that".

 It worked fine, taking about 0.1 seconds to complete, until a few hours
 ago when it started taking hours to complete. Vacuum/analyze didn't help.
 Turned out there was one session_id that had 400k rows in the system.
 Deleting that made the query performant again.

 Is there anything I can do to make the query work better in cases like
 that? Missing index, or better query?

 This is on 9.3.5.

 The below is reproduced at the following URL if it's not formatted
 correctly in the email.
 https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt

 explain  select
e1.product_id,
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
  from events e1
  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
  where
e1.product_id = '82503' and
e1.product_id != e2.product_id
  group by e1.product_id, e2.product_id, e2.site_id;
  QUERY PLAN
 
  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
  Sort Key: e1.product_id, e2.product_id, e2.site_id
  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
 rows=369 width=49)
  Recheck Cond: (product_id = '82503'::citext)
  ->  Bitmap Index Scan on 
 events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
Index Cond: (product_id = '82503'::citext)
->  Index Scan using events_session_id_type_product_id_idx 
 on events e2  (cost=0.56..51.28 rows=12 width=51)
  Index Cond: ((session_id = e1.session_id) AND (type = 
 e1.type))
  Filter: (e1.product_id <> product_id)
 (11 rows)

 recommender_production=> \d events
 Table "public.events"
Column|   Type   |  Modifiers
 -+--+-
  id  | bigint   | not null default 
 nextval('events_id_seq'::regclass)
  user_id | citext   |
  session_id  | citext   | not null
  product_id  | citext   | not null
  site_id | citext   | not null
  type| text | not null
  happened_at | timestamp with time zone | not null
  created_at  | timestamp with time zone | not null
 Indexes:
 "events_pkey" PRIMARY KEY, btree (id)
 "events_product_id_site_id_idx" btree (product_id, site_id)
 "events_session_id_type_product_id_idx" btree (session_id, type, 
 product_id)
 Check constraints:
 "events_session_id_check" CHECK (length(session_id::text) < 255)
 "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
 'viewed'::text]))
 "events_user_id_check" CHECK (length(user_id::text) < 255)




>>> After removing the session with 400k events, I was able to do an explain
>>> analyze, here is one of them:
>>> http://explain.depesz.com/s/PFNk
>>>
>>
>>
>


Re: [PERFORM] Query performance

2015-01-24 Thread Pavel Stehule
2015-01-25 7:38 GMT+01:00 Joe Van Dyk :

>
>
> On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule 
> wrote:
>
>> Hi
>>
>> this plan looks well
>>
>> Regards
>>
>> Pavel
>>
>
> Here's one that's not quite as well: http://explain.depesz.com/s/SgT
>

I see a possible issue

(product_id <> '81716'::citext) .. this operation is CPU expensive and
maybe nonsense

product_id should be integer -- and if it isn't - it should not be on 4M
rows extremly fast - mainly on citext

try to force a opposite cast - you will safe a case insensitive text
comparation

product_id::int <> 81716

Regards

Pavel




>
> Joe
>
>
>>
>> 2015-01-25 6:45 GMT+01:00 Joe Van Dyk :
>>
>>> Oops, didn't run vacuum analyze after deleting the events. Here is
>>> another 'explain analyze': http://explain.depesz.com/s/AviN
>>>
>>> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:
>>>
 On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:

> I have an events table that records page views and purchases (type =
> 'viewed' or type='purchased'). I have a query that figures out "people who
> bought/viewed this also bought/viewed that".
>
> It worked fine, taking about 0.1 seconds to complete, until a few
> hours ago when it started taking hours to complete. Vacuum/analyze didn't
> help.  Turned out there was one session_id that had 400k rows in the
> system. Deleting that made the query performant again.
>
> Is there anything I can do to make the query work better in cases like
> that? Missing index, or better query?
>
> This is on 9.3.5.
>
> The below is reproduced at the following URL if it's not formatted
> correctly in the email.
> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>
> explain  select
>e1.product_id,
>e2.site_id,
>e2.product_id,
>count(nullif(e2.type='viewed', false)) view_count,
>count(nullif(e2.type='purchased', false)) purchase_count
>  from events e1
>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>  where
>e1.product_id = '82503' and
>e1.product_id != e2.product_id
>  group by e1.product_id, e2.product_id, e2.site_id;
>  QUERY PLAN
> 
>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
> rows=369 width=49)
>  Recheck Cond: (product_id = '82503'::citext)
>  ->  Bitmap Index Scan on 
> events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
>Index Cond: (product_id = '82503'::citext)
>->  Index Scan using events_session_id_type_product_id_idx 
> on events e2  (cost=0.56..51.28 rows=12 width=51)
>  Index Cond: ((session_id = e1.session_id) AND (type 
> = e1.type))
>  Filter: (e1.product_id <> product_id)
> (11 rows)
>
> recommender_production=> \d events
> Table "public.events"
>Column|   Type   |  Modifiers
> -+--+-
>  id  | bigint   | not null default 
> nextval('events_id_seq'::regclass)
>  user_id | citext   |
>  session_id  | citext   | not null
>  product_id  | citext   | not null
>  site_id | citext   | not null
>  type| text | not null
>  happened_at | timestamp with time zone | not null
>  created_at  | timestamp with time zone | not null
> Indexes:
> "events_pkey" PRIMARY KEY, btree (id)
> "events_product_id_site_id_idx" btree (product_id, site_id)
> "events_session_id_type_product_id_idx" btree (session_id, type, 
> product_id)
> Check constraints:
> "events_session_id_check" CHECK (length(session_id::text) < 255)
> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
> 'viewed'::text]))
> "events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>
 After removing the session with 400k events, I was able to do an
 explain analyze, here is one of them:
 http://explain.depesz.com/s/PFNk

>>>
>>>
>>
>


Re: [PERFORM] Query performance

2015-01-24 Thread Joe Van Dyk
On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule 
wrote:

>
>
> 2015-01-25 7:38 GMT+01:00 Joe Van Dyk :
>
>>
>>
>> On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule 
>> wrote:
>>
>>> Hi
>>>
>>> this plan looks well
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>
>> Here's one that's not quite as well: http://explain.depesz.com/s/SgT
>>
>
> I see a possible issue
>
> (product_id <> '81716'::citext) .. this operation is CPU expensive and
> maybe nonsense
>
> product_id should be integer -- and if it isn't - it should not be on 4M
> rows extremly fast - mainly on citext
>
> try to force a opposite cast - you will safe a case insensitive text
> comparation
>
> product_id::int <> 81716
>

It might not always be an integer, just happens to be so here. Should I try
text instead? I don't have to have the case-insensitive matching.

Joe


>
> Regards
>
> Pavel
>
>
>
>
>>
>> Joe
>>
>>
>>>
>>> 2015-01-25 6:45 GMT+01:00 Joe Van Dyk :
>>>
 Oops, didn't run vacuum analyze after deleting the events. Here is
 another 'explain analyze': http://explain.depesz.com/s/AviN

 On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk  wrote:

> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk  wrote:
>
>> I have an events table that records page views and purchases (type =
>> 'viewed' or type='purchased'). I have a query that figures out "people 
>> who
>> bought/viewed this also bought/viewed that".
>>
>> It worked fine, taking about 0.1 seconds to complete, until a few
>> hours ago when it started taking hours to complete. Vacuum/analyze didn't
>> help.  Turned out there was one session_id that had 400k rows in the
>> system. Deleting that made the query performant again.
>>
>> Is there anything I can do to make the query work better in cases
>> like that? Missing index, or better query?
>>
>> This is on 9.3.5.
>>
>> The below is reproduced at the following URL if it's not formatted
>> correctly in the email.
>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>>
>> explain  select
>>e1.product_id,
>>e2.site_id,
>>e2.product_id,
>>count(nullif(e2.type='viewed', false)) view_count,
>>count(nullif(e2.type='purchased', false)) purchase_count
>>  from events e1
>>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>>  where
>>e1.product_id = '82503' and
>>e1.product_id != e2.product_id
>>  group by e1.product_id, e2.product_id, e2.site_id;
>>  QUERY PLAN
>> 
>>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>>->  Sort  (cost=828395.67..840117.89 rows=465 width=19)
>>  Sort Key: e1.product_id, e2.product_id, e2.site_id
>>  ->  Nested Loop  (cost=11.85..20371.14 rows=465 width=19)
>>->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
>> rows=369 width=49)
>>  Recheck Cond: (product_id = '82503'::citext)
>>  ->  Bitmap Index Scan on 
>> events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
>>Index Cond: (product_id = '82503'::citext)
>>->  Index Scan using 
>> events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 
>> rows=12 width=51)
>>  Index Cond: ((session_id = e1.session_id) AND (type 
>> = e1.type))
>>  Filter: (e1.product_id <> product_id)
>> (11 rows)
>>
>> recommender_production=> \d events
>> Table "public.events"
>>Column|   Type   |  Modifiers
>> -+--+-
>>  id  | bigint   | not null default 
>> nextval('events_id_seq'::regclass)
>>  user_id | citext   |
>>  session_id  | citext   | not null
>>  product_id  | citext   | not null
>>  site_id | citext   | not null
>>  type| text | not null
>>  happened_at | timestamp with time zone | not null
>>  created_at  | timestamp with time zone | not null
>> Indexes:
>> "events_pkey" PRIMARY KEY, btree (id)
>> "events_product_id_site_id_idx" btree (product_id, site_id)
>> "events_session_id_type_product_id_idx" btree (session_id, type, 
>> product_id)
>> Check constraints:
>> "events_session_id_check" CHECK (length(session_id::text) < 255)
>> "events_type_check" CHECK (type = ANY (ARRA