Re: [HACKERS] Question about sorting internals

2013-12-11 Thread Tom Lane
hubert depesz lubaczewski  writes:
> There are two simple queries: ...
> They differ only in order of queries in union all part.
> The thing is that they return the same result. Why isn't one of them returning
> "2005" for 6th "miesiac"?

With such a small amount of data, you're getting an in-memory quicksort,
and a well-known property of quicksort is that it isn't stable --- that
is, there are no guarantees about the order in which it will return items
that have equal keys.  In this case it's evidently making different
partitioning choices, as a consequence of the different arrival order of
the rows, that just by chance end up with the 6/2004/6 row being returned
before the 6/2005/6 row in both cases.  You could trace through the logic
and see exactly how that's happening, but I doubt it'd be a very edifying
exercise.

If you want to get well-defined results with DISTINCT ON, you should
make the ORDER BY sort by a candidate key.  Anything less opens you to
uncertainty about which rows the DISTINCT will select.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about sorting internals

2013-12-11 Thread Robert Haas
On Wed, Dec 11, 2013 at 4:56 AM, hubert depesz lubaczewski
 wrote:
> before I'll go any further - this is only thought-experiment. I do not
> plan to use such queries in real-life applications. I was just presented
> with a question that I can't answer in any logical way.
>
> There are two simple queries:
>
> #v+
> with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)  
> ,(4,4),(5,NULL),(6,6))
> ,rok2004 (miesiac,wynik) as (VALUES (1,3)  ,(3,3),(4,5) 
> ,(6,6))
> SELECT
> distinct on (miesiac) *
> FROM (
> SELECT miesiac, 2005 as rok, wynik FROM rok2005
> union all
> SELECT miesiac, 2004 as rok, wynik FROM rok2004
> ) as polaczone
> ORDER BY miesiac, wynik desc;
> #v-
>
> #v+
> with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)  
> ,(4,4),(5,NULL),(6,6))
> ,rok2004 (miesiac,wynik) as (VALUES (1,3)  ,(3,3),(4,5) 
> ,(6,6))
> SELECT
> distinct on (miesiac) *
> FROM (
> SELECT miesiac, 2004 as rok, wynik FROM rok2004
> union all
> SELECT miesiac, 2005 as rok, wynik FROM rok2005
> ) as polaczone
> ORDER BY miesiac, wynik desc;
> #v-
>
> They differ only in order of queries in union all part.
>
> The thing is that they return the same result. Why isn't one of them returning
> "2005" for 6th "miesiac"?

The query planner sees that in order for the output ordering to match
the ORDER BY clause, it's got to sort by miesiac, wynik desc.  The
DISTINCT ON clause can be implemented very cheaply after that - every
time the value of miesiac changes, it emits only the first of the rows
with the new value.  So it's a good plan.  However, because the sort
happens before the unique step, the results you get are dependent on
what order the sort happens to emit the rows.  Our sort algorithms are
not stable, so there's no particular guarantee about the order in
which rows will pop out, beyond the fact that they must obey the sort
key.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about sorting internals

2013-12-11 Thread hubert depesz lubaczewski
On Wed, Dec 11, 2013 at 03:34:38PM +0530, Ashutosh Bapat wrote:
> Hi deepesz,
> You might want to see their EXPLAIN VERBOSE outputs. Having one of them
> (2004 one) lesser number of rows, might be getting picked up as first
> relation being union and thus ends up having it's rows before the second
> one. Explain output would make it more clear. Also, try having same number
> of rows in both the relations.

Explains:

 QUERY PLAN

 Unique  (cost=0.44..0.48 rows=9 width=12) (actual time=0.030..0.035 rows=6 
loops=1)
   Output: rok2004.miesiac, (2004), rok2004.wynik
   CTE rok2004
 ->  Values Scan on "*VALUES*"  (cost=0.00..0.06 rows=5 width=8) (actual 
time=0.001..0.003 rows=5 loops=1)
   Output: "*VALUES*".column1, "*VALUES*".column2
   CTE rok2005
 ->  Values Scan on "*VALUES*_1"  (cost=0.00..0.05 rows=4 width=8) (actual 
time=0.000..0.001 rows=4 loops=1)
   Output: "*VALUES*_1".column1, "*VALUES*_1".column2
   ->  Sort  (cost=0.32..0.35 rows=9 width=12) (actual time=0.029..0.031 rows=9 
loops=1)
 Output: rok2004.miesiac, (2004), rok2004.wynik
 Sort Key: rok2004.miesiac, rok2004.wynik
 Sort Method: quicksort  Memory: 25kB
 ->  Append  (cost=0.00..0.18 rows=9 width=12) (actual 
time=0.007..0.018 rows=9 loops=1)
   ->  CTE Scan on rok2004  (cost=0.00..0.10 rows=5 width=12) 
(actual time=0.006..0.011 rows=5 loops=1)
 Output: rok2004.miesiac, 2004, rok2004.wynik
   ->  CTE Scan on rok2005  (cost=0.00..0.08 rows=4 width=12) 
(actual time=0.002..0.004 rows=4 loops=1)
 Output: rok2005.miesiac, 2005, rok2005.wynik
 Total runtime: 0.077 ms
(18 rows)

 QUERY PLAN

 Unique  (cost=0.44..0.48 rows=9 width=12) (actual time=0.024..0.027 rows=6 
loops=1)
   Output: rok2005.miesiac, (2005), rok2005.wynik
   CTE rok2004
 ->  Values Scan on "*VALUES*"  (cost=0.00..0.06 rows=5 width=8) (actual 
time=0.001..0.003 rows=5 loops=1)
   Output: "*VALUES*".column1, "*VALUES*".column2
   CTE rok2005
 ->  Values Scan on "*VALUES*_1"  (cost=0.00..0.05 rows=4 width=8) (actual 
time=0.001..0.003 rows=4 loops=1)
   Output: "*VALUES*_1".column1, "*VALUES*_1".column2
   ->  Sort  (cost=0.32..0.35 rows=9 width=12) (actual time=0.023..0.024 rows=9 
loops=1)
 Output: rok2005.miesiac, (2005), rok2005.wynik
 Sort Key: rok2005.miesiac, rok2005.wynik
 Sort Method: quicksort  Memory: 25kB
 ->  Append  (cost=0.00..0.18 rows=9 width=12) (actual 
time=0.004..0.015 rows=9 loops=1)
   ->  CTE Scan on rok2005  (cost=0.00..0.08 rows=4 width=12) 
(actual time=0.003..0.006 rows=4 loops=1)
 Output: rok2005.miesiac, 2005, rok2005.wynik
   ->  CTE Scan on rok2004  (cost=0.00..0.10 rows=5 width=12) 
(actual time=0.001..0.006 rows=5 loops=1)
 Output: rok2004.miesiac, 2004, rok2004.wynik
 Total runtime: 0.053 ms
(18 rows)

So, it looks like rowcount is the one thing that's different. Not
entirely sure how the logic would be to make rowcount differ.

After some more talk on #postgresql, it looks like I will have to spend
some time with debugger to see what's happening there.

Best regards,

depesz



signature.asc
Description: Digital signature


Re: [HACKERS] Question about sorting internals

2013-12-11 Thread Ashutosh Bapat
Hi deepesz,
You might want to see their EXPLAIN VERBOSE outputs. Having one of them
(2004 one) lesser number of rows, might be getting picked up as first
relation being union and thus ends up having it's rows before the second
one. Explain output would make it more clear. Also, try having same number
of rows in both the relations.


On Wed, Dec 11, 2013 at 3:26 PM, hubert depesz lubaczewski <
dep...@depesz.com> wrote:

> Hi,
>
> before I'll go any further - this is only thought-experiment. I do not
> plan to use such queries in real-life applications. I was just presented
> with a question that I can't answer in any logical way.
>
> There are two simple queries:
>
> #v+
> with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)
>  ,(4,4),(5,NULL),(6,6))
> ,rok2004 (miesiac,wynik) as (VALUES (1,3)  ,(3,3),(4,5)
> ,(6,6))
> SELECT
> distinct on (miesiac) *
> FROM (
> SELECT miesiac, 2005 as rok, wynik FROM rok2005
> union all
> SELECT miesiac, 2004 as rok, wynik FROM rok2004
> ) as polaczone
> ORDER BY miesiac, wynik desc;
> #v-
>
> #v+
> with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)
>  ,(4,4),(5,NULL),(6,6))
> ,rok2004 (miesiac,wynik) as (VALUES (1,3)  ,(3,3),(4,5)
> ,(6,6))
> SELECT
> distinct on (miesiac) *
> FROM (
> SELECT miesiac, 2004 as rok, wynik FROM rok2004
> union all
> SELECT miesiac, 2005 as rok, wynik FROM rok2005
> ) as polaczone
> ORDER BY miesiac, wynik desc;
> #v-
>
> They differ only in order of queries in union all part.
>
> The thing is that they return the same result. Why isn't one of them
> returning
> "2005" for 6th "miesiac"?
>
> I know I'm not sorting using "rok", which means I'm getting "undefined
> functionality". Fine. But what exactly is happening that regardless of
> order of rows in subquery, I get the same, always lower, rok in output?
>
> Best regards,
>
> depesz
>
> --
> The best thing about modern society is how easy it is to avoid contact
> with it.
>
> http://depesz.com/
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company