Re: estimate correlation of index separately from table (Re: [PERFORM] index fragmentation on insert-only table with non-unique column)

2017-07-07 Thread Peter Geoghegan
On Fri, Jul 7, 2017 at 4:41 PM, Justin Pryzby  wrote:
> The second change averages separate correlation values of small lists of 300
> consecutive TIDs, rather than the course-granularity/aggregate correlation of 
> a
> small sample of pages across the entire index.  Postgres' existing sampling is
> designed to give an even sample across all rows.  An issue with this
> course-granularity correlation is that the index can have a broad correlation
> (to physical heap location), but with many small-scale deviations, which don't
> show up due to sampling a relatively small fraction of a large table; and/or
> the effect of the deviations is insignificant/noise and correlation is still
> computed near 1.
>
> I believe the "large scale" correlation that postgres computes from block
> sample fails to well represent small-scale uncorrelated reads which contribute
> large number of random reads, but not included in planner cost.

All of that may well be true, but I've actually come around to the
view that we should treat TID as a first class part of the keyspace,
that participates in comparisons as an implicit last column in all
cases (not just when B-Trees are built within tuplesort.c). That would
probably obviate the need for a patch like this entirely, because
pg_stats.correlation would be unaffected by duplicates. I think that
this would have a number of advantages, some of which might be
significant. For example, I think it could make LP_DEAD cleanup within
nbtree more effective for some workloads, especially workloads where
it is important for HOT pruning and LP_DEAD cleanup to work in concert
-- locality of access probably matters with that. Also, with every
entry in the index guaranteed to be unique, we can imagine VACUUM
being much more selective with killing index entries, when the TID
array it uses happens to be very small. With the freeze map stuff
that's now in place, being able to do that matters more than before.

The original Lehman & Yao algorithm is supposed to have unique keys in
all cases, but we don't follow that in order to make duplicates work,
which is implemented by changing an invariant (see nbtree README for
details). So, this could simplify some aspects of how binary searches
must work in the face of having to deal with non-unique keys. I think
that there are cases where many non-HOT UPDATEs have to go through a
bunch of duplicate leaf tuples and do visibility checking on old
versions for no real benefit. With the TID as part of the keyspace, we
could instead tell the B-Tree code to insert a new tuple as part of an
UPDATE while using the TID as part of its insertion scan key, so
rummaging through many duplicates is avoided.

That having been said, it would be hard to do this for all the reasons
I went into in that thread you referenced [1]. If you're going to
treat TID as a part of the keyspace, you have to totally embrace that,
which means that the internal pages need to have heap TIDs too (not
just pointers to the lower levels in the B-Tree, which the
IndexTuple's t_tid pointer is used for there). Those are the place
where you need to literally append this new, implicit heap TID column
as if it was just another user-visible attribute, since that
information isn't stored in the internal pages today at all. Doing
that has a cost, which isn't going to be acceptable if we naively
append a heap TID to every internal page IndexTuple. With a type like
int4, you're going to completely bloat those pages, with big
consequences for fan-in.

So, really, what needs to happen to make it work is someone needs to
write a suffix truncation patch, which implies that only those cases
that actually benefit from increasing the width of internal page
IndexTuples (those with many "would-be duplicates") pay the cost. This
is a classic technique, that I've actually already prototyped, though
that's extremely far from being worth posting here. That was just to
verify my understanding.

I think that I should write and put up for discussion a design
document for various nbtree enhancements. These include internal page
suffix truncation, prefix compression, and key normalization. I'm
probably not going to take any of these projects on, but it would be
useful if there was at least a little bit of clarity about how they
could be implemented. Maybe we could reach some kind of weak consensus
without going to great lengths. These optimizations are closely
intertwined things, and the lack of clarity on how they all fit
together is probably holding back an implementation of any one of
them.

[1] 
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


estimate correlation of index separately from table (Re: [PERFORM] index fragmentation on insert-only table with non-unique column)

2017-07-07 Thread Justin Pryzby
Months ago I reported an issue with very slow index scan of tables with high
"correlation" of its indexed column, due to (we concluded at the time)
duplicate/repeated values of that column causing many lseek()s.  References:

https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com
https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou...@4ax.com

My interim workarounds were an reindex job and daily granularity partitions
(despite having an excessive number of child tables) to query execution time]]
to encourage seq scans for daily analysis jobs rather than idx scan.  I've now
cobbled together a patch to throw around and see if we can improve on that.  I
tried several changes, hoping to discourage index scan.  The logic that seems
to most accurately reflect costs has two changes:  

Postgres' existing behavior stores table correlation (heap value vs. position)
but doesn't look at index correlation, so can't distinguish between a just-built
index, and a highly fragmented index, or one with highly-nonsequential TIDs.
My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine
correlation of heap TID vs index tuple logical location (as opposed to the
table correlation, computed as: heap TID vs. heap value).

The second change averages separate correlation values of small lists of 300
consecutive TIDs, rather than the course-granularity/aggregate correlation of a
small sample of pages across the entire index.  Postgres' existing sampling is
designed to give an even sample across all rows.  An issue with this
course-granularity correlation is that the index can have a broad correlation
(to physical heap location), but with many small-scale deviations, which don't
show up due to sampling a relatively small fraction of a large table; and/or
the effect of the deviations is insignificant/noise and correlation is still
computed near 1.

I believe the "large scale" correlation that postgres computes from block
sample fails to well represent small-scale uncorrelated reads which contribute
large number of random reads, but not included in planner cost.

Not only are the index reads highly random (which the planner already assumes),
but the CTIDs referenced within a given btree leaf page are also substantially
non-sequential.  It seems to affect INSERTs which, over a short interval of
time have a small-moderate range of column values, but which still have a
strong overall trend in that column WRT time (and heap location).  Think of
inserting a "normal" distribution of timestamps centered around now().

My original report shows lseek() for nearly every read on the *heap*.  The
original problem was on a table of telecommunications CDRs, indexed by "record
opening time" (start time) and with high correlation value in pg_stats.  We
import data records from a file, which is probably more or less in order of
"end time".

That still displays broad correlation on "start time", but individual TIDs are
nowhere near sequential.  Two phone calls which end in the same 1 second
interval are not unlikely to have started many minutes apart...  but two calls
which end within the same second are very likely to have started within an hour
of each other.. since typical duration is <1h.  But, insert volume is high, and
there are substantial repeated keys, so the portion of an index scan returning
CTIDs for some certain key value (timestamp with second resolution in our case)
ends up reading heap tuples for a non-negligible fraction of the table: maybe
only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ...
which is still 300 seeks, and takes long enough that cache is inadequate to
substantially mitigate the cost.

It's not clear to me that there's a good way to evenly *sample* a fraction of
the index blocks in a manner which is agnostic to different AMs.  Scanning the
entirety of all indices on relations during (auto) analyze may be too
expensive.  So I implemented index scan of the MCV list.  I'm guessing this
might cause the correlation to be under-estimated, and prefer bitmap scans
somewhat more than justified, due to btree insertion logic for repeated keys,
to reduce O(n^2) behavior.  MCV list isn't perfect since that can happen with
eg. normally distributed floating point values (or timestamp with fractional
seconds).

I ran pageinspect on a recent child of the table that triggered the original
report:

ts=# SELECT itemoffset, ctid FROM 
bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT 
22 OFFSET 1;
 itemoffset |  ctid  
+
  2 | (81,4)
  3 | (5,6)
  4 | (6,5)
  5 | (9,1)
  6 | (10,1)
  7 | (14,1)
  8 | (21,8)
  9 | (25,1)
 10 | (30,5)
 11 | 

Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-08-15 Thread Claudio Freire
On Sat, Aug 13, 2016 at 3:54 PM, Justin Pryzby  wrote:
> On Sun, Jun 05, 2016 at 12:28:47PM -0700, Jeff Janes wrote:
>> On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane  wrote:
>> > Claudio Freire  writes:
>> >> So correlated index scans look extra favourable vs bitmap index scans
>> >> because bitmap heap scans consider random page costs sans correlation
>> >> effects (even though correlation applies to bitmap heap scans as
>> >> well).
>> >
>> > Really?  How?  The index ordering has nothing to do with the order in
>> > which heap tuples will be visited.
>>
>> It is not the order itself, but the density.
>>
>> If the index is read in a range scan (as opposed to =ANY scan), and
>> the index lead column is correlated with the table ordering, then the
>> parts of the table that need to be visited will be much denser than if
>> there were no correlation.  But Claudio is saying that this is not
>> being accounted for.
>
> I didn't completely understand Claudio/Jeff here, and not sure if we're on the
> same page.  For queries on these tables, the index scan was very slow, due to
> fragmented index on non-unique column, and seq scan would have been (was)
> faster (even if it means reading 70GB and filtering out 6 of 7 days' data).
> That was resolved by added a nightly reindex job (..  which sometimes competes
> with other maintenance and has trouble running every table every night).

Yes, but a bitmap index scan should be faster than both, but the
planner is discarding it beause it estimates it will be slower,
because it doesn't account for correlation between index keys and
physical location.

And, while what you clarify there would indeed affect the estimation
for index scans, it would only make the issue worse: the planner
thinks the index scan will be better than it really is, because it's
expecting correlation, but the "fragmentation" of same-key runs
destroys that correlation. A bitmap index scan would restore it,
though, so the bitmap index scan would be that much better.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-08-13 Thread Justin Pryzby
Regarding this earlier thread:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.ga11...@telsasoft.com

On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby  wrote:
> Summary: Non-unique btree indices are returning CTIDs for rows with same
> value of indexed column not in logical order, imposing a high performance
> penalty.

I have to point out that by "logical" I clearly meant "physical", hopefully
nobody was too misled..

On Sun, Jun 05, 2016 at 12:28:47PM -0700, Jeff Janes wrote:
> On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane  wrote:
> > Claudio Freire  writes:
> >> So correlated index scans look extra favourable vs bitmap index scans
> >> because bitmap heap scans consider random page costs sans correlation
> >> effects (even though correlation applies to bitmap heap scans as
> >> well).
> >
> > Really?  How?  The index ordering has nothing to do with the order in
> > which heap tuples will be visited.
> 
> It is not the order itself, but the density.
> 
> If the index is read in a range scan (as opposed to =ANY scan), and
> the index lead column is correlated with the table ordering, then the
> parts of the table that need to be visited will be much denser than if
> there were no correlation.  But Claudio is saying that this is not
> being accounted for.

I didn't completely understand Claudio/Jeff here, and not sure if we're on the
same page.  For queries on these tables, the index scan was very slow, due to
fragmented index on non-unique column, and seq scan would have been (was)
faster (even if it means reading 70GB and filtering out 6 of 7 days' data).
That was resolved by added a nightly reindex job (..  which sometimes competes
with other maintenance and has trouble running every table every night).

But I did find that someone else had previously reported this problem (in a
strikingly similar context and message, perhaps clearer than mine):
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520d6610.8040...@emulex.com

I also found this older thread:
https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou...@4ax.com

There was mention of a TODO item:

* Compute index correlation on CREATE INDEX and ANALYZE, use it for index
* scan cost estimation

.. but perhaps I misunderstand and that's long since resolved ?

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-06-05 Thread Jeff Janes
On Sun, Jun 5, 2016 at 9:03 AM, Tom Lane  wrote:
> Claudio Freire  writes:
>> So correlated index scans look extra favourable vs bitmap index scans
>> because bitmap heap scans consider random page costs sans correlation
>> effects (even though correlation applies to bitmap heap scans as
>> well).
>
> Really?  How?  The index ordering has nothing to do with the order in
> which heap tuples will be visited.


It is not the order itself, but the density.

If the index is read in a range scan (as opposed to =ANY scan), and
the index lead column is correlated with the table ordering, then the
parts of the table that need to be visited will be much denser than if
there were no correlation.  But Claudio is saying that this is not
being accounted for.


Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-06-05 Thread Tom Lane
Claudio Freire  writes:
> So correlated index scans look extra favourable vs bitmap index scans
> because bitmap heap scans consider random page costs sans correlation
> effects (even though correlation applies to bitmap heap scans as
> well).

Really?  How?  The index ordering has nothing to do with the order in
which heap tuples will be visited.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-06-04 Thread Kevin Grittner
On Fri, Jun 3, 2016 at 6:54 PM, Justin Pryzby  wrote:

>  max_wal_size| 4GB | configuration file
>  min_wal_size| 6GB | configuration file

Just a minor digression -- it generally doesn't make sense to
configure the minimum for something greater than the maximum for
that same thing.  That should have no bearing the performance issue
raised on the thread, but you might want to fix it anyway.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-06-03 Thread Claudio Freire
On Fri, Jun 3, 2016 at 8:54 PM, Justin Pryzby  wrote:
> As a test, I did SET effective_cache_size='1MB', before running explain, and
> still does:
>
> |->  Index Scan using 
> cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx on 
> cdrs_huawei_pgwrecord_2016_05_29  (cost=0.44..1526689.49 rows=8342796 
> width=355)
> |  Index Cond: ((recordopeningtime >= '2016-05-29 
> 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 
> 00:00:00'::timestamp without time zone))
>
> I Set enable_indexscan=0, and got:
>
> |->  Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_29  
> (cost=168006.10..4087526.04 rows=8342796 width=355)
> |  Recheck Cond: ((recordopeningtime >= '2016-05-29 
> 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 
> 00:00:00'::timestamp without time zone))
> |  ->  Bitmap Index Scan on 
> cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx  (cost=0.00..165920.40 
> rows=8342796 width=0)
> |Index Cond: ((recordopeningtime >= '2016-05-29 
> 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 
> 00:00:00'::timestamp without time zone))
>
> Here's a minimal query which seems to isolate the symptom:
>
> ts=# explain (analyze,buffers) SELECT sum(duration) FROM 
> cdrs_huawei_pgwrecord_2016_05_22 WHERE recordopeningtime>='2016-05-22' AND 
> recordopeningtime<'2016-05-23';
> | Aggregate  (cost=2888731.67..2888731.68 rows=1 width=8) (actual 
> time=388661.892..388661.892 rows=1 loops=1)
> |   Buffers: shared hit=4058501 read=1295147 written=35800
> |   ->  Index Scan using 
> cdrs_huawei_pgwrecord_2016_05_22_recordopeningtime_idx on 
> cdrs_huawei_pgwrecord_2016_05_22  (cost=0.56..2867075.33 rows=8662534 w
> |idth=8) (actual time=0.036..379332.910 rows=8575673 loops=1)
> | Index Cond: ((recordopeningtime >= '2016-05-22 00:00:00'::timestamp 
> without time zone) AND (recordopeningtime < '2016-05-23 00:00:00'::timestamp
> | without time zone))
> | Buffers: shared hit=4058501 read=1295147 written=35800
> | Planning time: 0.338 ms
> | Execution time: 388661.947 ms
>
> And here's an older one to avoid cache, with enable_indexscan=0
> |ts=# explain (analyze,buffers)  SELECT sum(duration) FROM 
> cdrs_huawei_pgwrecord_2016_05_08 WHERE recordopeningtime>='2016-05-08' AND 
> recordopeningtime<'2016-05-09';
> | Aggregate  (cost=10006286.58..10006286.59 rows=1 width=8) (actual 
> time=44219.156..44219.156 rows=1 loops=1)
> |   Buffers: shared hit=118 read=1213887 written=50113
> |   ->  Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_08  
> (cost=85142.24..9985848.96 rows=8175048 width=8) (actual 
> time=708.024..40106.062 rows=8179338 loops=1)
> | Recheck Cond: ((recordopeningtime >= '2016-05-08 
> 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-09 
> 00:00:00'::timestamp without time zone))
> | Rows Removed by Index Recheck: 74909
> | Heap Blocks: lossy=1213568
> | Buffers: shared hit=118 read=1213887 written=50113
> | ->  Bitmap Index Scan on 
> cdrs_huawei_pgwrecord_2016_05_08_recordopeningtime_idx1  (cost=0.00..83098.48 
> rows=8175048 width=0) (actual time=706.557..706.557 rows=12135680 loops=1)
> |   Index Cond: ((recordopeningtime >= '2016-05-08 
> 00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-09 
> 00:00:00'::timestamp without time zone))
> |   Buffers: shared hit=117 read=320
> | Planning time: 214.786 ms
> | Execution time: 44228.874 ms
> |(12 rows)


Correct me if I'm wrong, but this looks like the planner not
accounting for correlation when using bitmap heap scans.

Checking the source, it really doesn't.

So correlated index scans look extra favourable vs bitmap index scans
because bitmap heap scans consider random page costs sans correlation
effects (even though correlation applies to bitmap heap scans as
well). While that sounds desirable a priori, it seems it's hurting
this case quite badly.

I'm not sure there's any simple way of working around that.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-06-03 Thread Justin Pryzby
On Fri, Jun 03, 2016 at 06:26:33PM -0300, Claudio Freire wrote:
> On Wed, May 25, 2016 at 11:00 AM, Justin Pryzby  wrote:
> >> > First, I found I was able to get 30-50min query results on full week's 
> >> > table by
> >> > prefering a seq scan to an index scan.  The row estimates seemed fine, 
> >> > and the
> >> > only condition is the timestamp, so the planner's use of index scan is as
> >> > expected.
> >>
> >> Can you show us the query?  I would expect a bitmap scan of the index
> >> (which would do what you want, but even more so), instead.
> > See explain, also showing additional tables/views being joined.  It's NOT 
> > doing
> > a bitmap scan though, and I'd be interested to find why; I'm sure that 
> > would've
> > improved this query enough so it never would've been an issue.
> > https://explain.depesz.com/s/s8KP
> >
> >  ->  Index Scan using 
> > cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on 
> > cdrs_huawei_pgwrecord_2016_05_01  (cost=0.56..1601734.57 rows=8943848 
> > width=349)
> >Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp 
> > without time zone) AND (recordopeningtime < '2016-05-08 
> > 00:00:00'::timestamp without time zone))
> 
> Please show your guc settings ( see
> https://wiki.postgresql.org/wiki/Server_Configuration )
> 
> A plan node like that, if it would result in I/O, with proper
> configuration should have selected a bitmap index/heap scan. If it
> didn't, it probably thinks it has more cache than it really does, and
> that would mean the wrong setting was set in effective_cache_size.

ts=# SELECT name, current_setting(name), SOURCE FROM pg_settings WHERE 
SOURCE='configuration file';
 dynamic_shared_memory_type  | posix   | configuration file
 effective_cache_size| 64GB| configuration file
 effective_io_concurrency| 8   | configuration file
 huge_pages  | try | configuration file
 log_autovacuum_min_duration | 0   | configuration file
 log_checkpoints | on  | configuration file
 maintenance_work_mem| 6GB | configuration file
 max_connections | 200 | configuration file
 max_wal_size| 4GB | configuration file
 min_wal_size| 6GB | configuration file
 shared_buffers  | 8GB | configuration file
 wal_compression | on  | configuration file
 work_mem| 1GB | configuration file

I changed at least maintenance_work_mem since I originally wrote, to try to
avoid tempfiles during REINDEX (though I'm not sure it matters, as the
tempfiles are effective cached and may never actually be written).

It's entirely possible those settings aren't ideal.  The server has 72GB RAM.
There are usually very few (typically n<3 but at most a handful) nontrivial
queries running at once, if at all.

I wouldn't expect any data that's not recent (table data last 2 days or index
from this month) to be cached, and wouldn't expect that to be entirely cached,
either:

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE 
relname~'_2016_05_..$';
gb | 425.783050537109

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE 
relname~'_2016_05_...*idx';
gb | 60.0909423828125

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE 
relname~'_201605.*idx';
gb | 4.85528564453125

ts=# SELECT sum(pg_table_size(oid))/1024^3 gb FROM pg_class WHERE 
relname~'_201605$';
gb | 86.8688049316406

As a test, I did SET effective_cache_size='1MB', before running explain, and
still does:

|->  Index Scan using 
cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx on 
cdrs_huawei_pgwrecord_2016_05_29  (cost=0.44..1526689.49 rows=8342796 width=355)
|  Index Cond: ((recordopeningtime >= '2016-05-29 
00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 
00:00:00'::timestamp without time zone))

I Set enable_indexscan=0, and got:

|->  Bitmap Heap Scan on cdrs_huawei_pgwrecord_2016_05_29  
(cost=168006.10..4087526.04 rows=8342796 width=355)
|  Recheck Cond: ((recordopeningtime >= '2016-05-29 
00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 
00:00:00'::timestamp without time zone))
|  ->  Bitmap Index Scan on 
cdrs_huawei_pgwrecord_2016_05_29_recordopeningtime_idx  (cost=0.00..165920.40 
rows=8342796 width=0)
|Index Cond: ((recordopeningtime >= '2016-05-29 
00:00:00'::timestamp without time zone) AND (recordopeningtime < '2016-05-30 
00:00:00'::timestamp without time zone))

Here's a minimal query which seems to isolate the symptom:

ts=# explain (analyze,buffers) SELECT sum(duration) FROM 
cdrs_huawei_pgwrecord_2016_05_22 WHERE recordopeningtime>='2016-05-22' AND 

Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-06-03 Thread Claudio Freire
On Wed, May 25, 2016 at 11:00 AM, Justin Pryzby  wrote:
>> > First, I found I was able to get 30-50min query results on full week's 
>> > table by
>> > prefering a seq scan to an index scan.  The row estimates seemed fine, and 
>> > the
>> > only condition is the timestamp, so the planner's use of index scan is as
>> > expected.
>>
>> Can you show us the query?  I would expect a bitmap scan of the index
>> (which would do what you want, but even more so), instead.
> See explain, also showing additional tables/views being joined.  It's NOT 
> doing
> a bitmap scan though, and I'd be interested to find why; I'm sure that 
> would've
> improved this query enough so it never would've been an issue.
> https://explain.depesz.com/s/s8KP
>
>  ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx 
> on cdrs_huawei_pgwrecord_2016_05_01  (cost=0.56..1601734.57 rows=8943848 
> width=349)
>Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp 
> without time zone) AND (recordopeningtime < '2016-05-08 00:00:00'::timestamp 
> without time zone))

Please show your guc settings ( see
https://wiki.postgresql.org/wiki/Server_Configuration )

A plan node like that, if it would result in I/O, with proper
configuration should have selected a bitmap index/heap scan. If it
didn't, it probably thinks it has more cache than it really does, and
that would mean the wrong setting was set in effective_cache_size.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-25 Thread Justin Pryzby
On Tue, May 24, 2016 at 11:23:48PM -0700, Jeff Janes wrote:
> > But note the non-uniqueness of the index column:
> > ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE 
> > recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 
> > 1 ORDER BY 2 DESC;
> >   recordopeningtime  | count
> > -+---
> >  2016-05-21 12:17:29 |   176
> >  2016-05-21 12:17:25 |   171
> >  2016-05-21 13:11:33 |   170
> >  2016-05-21 10:20:02 |   169
> >  2016-05-21 11:30:02 |   167
> > [...]
> 
> That is not that much duplication.  You aren't going to have dozens or
> hundreds of leaf pages all with equal values.   (and you only showed
> the most highly duplicated ones, presumably the average is much less)

Point taken, but it's not that great of a range either:

ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE 
recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 
ORDER BY 2 LIMIT 19;
  recordopeningtime  | count 
-+---
 2016-05-21 03:10:05 |44
 2016-05-21 03:55:05 |44
 2016-05-21 04:55:05 |45

ts=# SELECT count(distinct recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE 
recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22';
-[ RECORD 1 ]
count | 86400

ts=# SELECT count(recordopeningtime) FROM cdrs_huawei_pgwrecord WHERE 
recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22';
-[ RECORD 1 ]--
count | 8892865

> > We have an daily analytic query which processes the previous day's data.  
> > For
> > new child tables, with only 1 days data loaded, this runs in ~30min, and for
> > child tables with an entire week's worth of data loaded, takes several hours
> > (even though both queries process the same amount of data).
> 
> For an append only table, why would the first day of a new partition
> be any less fragmented than that same day would be a week from now?
> Are you sure it isn't just that your week-old data has all been aged
> out of the cache?
I don't think it's cache effect, since we're not using the source table for
(maybe anything) else the entire rest of the day.  Server has 72GB RAM, same
size one the largest of the tables being joined (beginning) at 4am.

I didn't mean that a given day is more fragmented now than it was last week
(but I don't know, either).  I guess when we do a query on the table with ~32
hours of data in, it might do a seq scan rather than index scan, too.

Compare the end of month partition tables:
ts=# select * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_02_29_recordopeningtime_idx');
leaf_fragmentation | 48.6
ts=# select * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_03_29_recordopeningtime_idx');
leaf_fragmentation | 48.38
ts=# select * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_04_29_recordopeningtime_idx');
leaf_fragmentation | 48.6
ts=# SELECT * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_04_22_recordopeningtime_idx');
leaf_fragmentation | 48.66
ts=# SELECT * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_03_22_recordopeningtime_idx');
leaf_fragmentation | 48.27
ts=# SELECT * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_02_22_recordopeningtime_idx');
leaf_fragmentation | 48

This one I reindexed as a test:
ts=# SELECT * FROM 
pgstatindex('cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx');
leaf_fragmentation | 0.01

.. and query ran in ~30min (reran a 2nd time, with cache effects: 25min).

> > First, I found I was able to get 30-50min query results on full week's 
> > table by
> > prefering a seq scan to an index scan.  The row estimates seemed fine, and 
> > the
> > only condition is the timestamp, so the planner's use of index scan is as
> > expected.
> 
> Can you show us the query?  I would expect a bitmap scan of the index
> (which would do what you want, but even more so), instead.
See explain, also showing additional tables/views being joined.  It's NOT doing
a bitmap scan though, and I'd be interested to find why; I'm sure that would've
improved this query enough so it never would've been an issue.
https://explain.depesz.com/s/s8KP

 ->  Index Scan using cdrs_huawei_pgwrecord_2016_05_01_recordopeningtime_idx on 
cdrs_huawei_pgwrecord_2016_05_01  (cost=0.56..1601734.57 rows=8943848 width=349)
   Index Cond: ((recordopeningtime >= '2016-05-07 00:00:00'::timestamp 
without time zone) AND (recordopeningtime < '2016-05-08 00:00:00'::timestamp 
without time zone))

> > AFAICT what's happening is that the index scan was returning pages
> > nonsequentially.  strace-ing the backend showed alternating lseek()s and
> > read()s, with the offsets not consistently increasing (nor consistently
> > decreasing):
..
> 
> Which of those are the table, and which the index?
Those weren't necessarily strace of the same process; I believe both of these
were table data/heap, and didn't include any index access.

> Something doesn't add up here.  How could an index of an append-only
> table possibly become that 

Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-25 Thread Justin Pryzby
On Tue, May 24, 2016 at 09:16:20PM -0700, Peter Geoghegan wrote:
> On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby  wrote:
> > Postgres seems to assume that the high degree of correlation of the table
> > column seen in pg_stats is how it will get data from the index scan, which
> > assumption seems to be very poor on what turns out to be a higly fragmented
> > index.  Is there a way to help it to understand otherwise??
> 
> Your complaint is vague. Are you complaining about the planner making
> a poor choice? I don't think that's the issue here, because you never
> made any firm statement about the planner making a choice that was
> worth than an alternative that it had available.

I was thinking there a few possible places to make improvements: the planner
could have understood that scans of non-unique indices don't result in strictly
sequential scans of the table, the degree of non-sequentialness being
determined by the column statistics, and perhaps by properties of the index
itself.

Or the INSERT code or btree scan could improve on this, even if tuples aren't
fully ordered.

> If you're arguing for the idea that B-Trees should reliably keep
> tuples in order by a tie-break condition, that seems difficult to
> implement, and likely not worth it in practice.

I had the awful idea to change the index to use (recordopeningtime,ctid).
Maybe somebody will convince me otherwise, but may actually work better than
trying to reindex this table daily by 4am.

Thanks,
Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-25 Thread Jeff Janes
On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby  wrote:
> Summary: Non-unique btree indices are returning CTIDs for rows with same
> value of indexed column not in logical order, imposing a high performance
> penalty.
>
> Running PG 9.5.3 now, we have a time-based partitions of append-only tables
> with data loaded from other sources.  The tables are partitioned by time, and
> timestamp column has an non-unique, not-null btree index.
>
> The child tables are each ~75GB and expected to keep growing.  For a child
> table with a week's worth of data:
> relpages  | 11255802
> reltuples | 5.90502e+07
>
> The data is loaded shortly after it's available, so have high correlation in
> pg_statistic:
> [pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct 
> FROM pg_stats s JOIN pg_class c ON (c.relname=s.tablename) WHERE tablename 
> LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" 
> |head
> tablename | correlation | n_distinct
> --+-+
>  cdrs_huawei_pgwrecord|0.97 | 102892
>  cdrs_huawei_pgwrecord_2016_02_15 |0.999658 |  96145
>  cdrs_huawei_pgwrecord_2016_02_22 |0.43 |  91916
>  cdrs_huawei_pgwrecord_2016_02_29 |0.997219 |  50341
>  cdrs_huawei_pgwrecord_2016_03_01 |0.47 |  97485
>
> But note the non-uniqueness of the index column:
> ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE 
> recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 
> ORDER BY 2 DESC;
>   recordopeningtime  | count
> -+---
>  2016-05-21 12:17:29 |   176
>  2016-05-21 12:17:25 |   171
>  2016-05-21 13:11:33 |   170
>  2016-05-21 10:20:02 |   169
>  2016-05-21 11:30:02 |   167
> [...]

That is not that much duplication.  You aren't going to have dozens or
hundreds of leaf pages all with equal values.   (and you only showed
the most highly duplicated ones, presumably the average is much less)


> We have an daily analytic query which processes the previous day's data.  For
> new child tables, with only 1 days data loaded, this runs in ~30min, and for
> child tables with an entire week's worth of data loaded, takes several hours
> (even though both queries process the same amount of data).

For an append only table, why would the first day of a new partition
be any less fragmented than that same day would be a week from now?
Are you sure it isn't just that your week-old data has all been aged
out of the cache?


> First, I found I was able to get 30-50min query results on full week's table 
> by
> prefering a seq scan to an index scan.  The row estimates seemed fine, and the
> only condition is the timestamp, so the planner's use of index scan is as
> expected.

Can you show us the query?  I would expect a bitmap scan of the index
(which would do what you want, but even more so), instead.

>
> AFAICT what's happening is that the index scan was returning pages
> nonsequentially.  strace-ing the backend showed alternating lseek()s and
> read()s, with the offsets not consistently increasing (nor consistently
> decreasing):
> % sudo strace -p 25588 2>&1 |grep -m9 'lseek(773'
> lseek(773, 1059766272, SEEK_SET)= 1059766272
> lseek(773, 824926208, SEEK_SET) = 824926208
> lseek(773, 990027776, SEEK_SET) = 990027776
> lseek(773, 990330880, SEEK_SET) = 990330880
> lseek(773, 1038942208, SEEK_SET)= 1038942208
> lseek(773, 1059856384, SEEK_SET)= 1059856384
> lseek(773, 977305600, SEEK_SET) = 977305600
> lseek(773, 990347264, SEEK_SET) = 990347264
> lseek(773, 871096320, SEEK_SET) = 871096320
>
> .. and consecutive read()s being rare:
> read(802, "g"..., 8192) = 8192
> lseek(802, 918003712, SEEK_SET) = 918003712
> read(802, "c"..., 8192) = 8192
> lseek(802, 859136000, SEEK_SET) = 859136000
> read(802, "a"..., 8192) = 8192
> lseek(802, 919601152, SEEK_SET) = 919601152
> read(802, "d"..., 8192) = 8192
> lseek(802, 905101312, SEEK_SET) = 905101312
> read(802, "c"..., 8192) = 8192
> lseek(801, 507863040, SEEK_SET) = 507863040
> read(801, "p"..., 8192) = 8192
> lseek(802, 914235392, SEEK_SET) = 914235392
> read(802, "c"..., 8192) = 8192


Which of those are the table, and which the index?

Something doesn't add up here.  How could an index of an append-only
table possibly become that fragmented, when the highest amount of key
duplication is about 170?

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-24 Thread Peter Geoghegan
On Tue, May 24, 2016 at 9:43 PM, Tom Lane  wrote:
> Yeah.  I wonder what would happen if we used the same rule for index
> insertions.  It would likely make insertions more expensive, but maybe
> not by much.  The existing "randomization" rule for where to insert new
> items in a run of identical index entries would go away, because the
> insertion point would become deterministic.  I am not sure if that's
> good or bad for insertion performance, but it would likely help for
> scan performance.

I think that if somebody tacked on a tie-breaker in the same way as in
tuplesort.c's B-Tree IndexTuple, there'd be significant negative
consequences.

The equal-to-high-key case gets a choice of which page to put the new
IndexTuple on, and I imagine that that's quite useful when it comes
up. I'd also have concerns about the key space in the index. I think
that it would seriously mess with the long term utility of values in
internal pages, which currently can reasonably have little to do with
the values currently stored in leaf pages. They're functionally only
separators of the key space that guide index scans, so it doesn't
matter if the actual values are completely absent from the leaf
pages/the table itself (perhaps some of the values in the internal
pages were inserted years ago, and have long since been deleted and
vacuumed away). Provided the distribution of values at the leaf level
is still well characterized at higher levels (e.g. many string values
that start with vowels, very few that start with the letters 'x' or
'z'), there should be no significant bloat. That's very valuable.
Unique indexes are another problem for this naive approach.

Maybe somebody could do better with a more sophisticated approach, but
it's probably better to focus on duplicate storage or even leaf page
compression, as Stephen mentioned.

-- 
Peter Geoghegan


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-24 Thread Tom Lane
Peter Geoghegan  writes:
> The basic problem is that the B-Tree code doesn't maintain this
> property. However, B-Tree index builds will create an index that
> initially has this property, because the tuplesort.c code happens to
> sort index tuples with a CTID tie-breaker.

Yeah.  I wonder what would happen if we used the same rule for index
insertions.  It would likely make insertions more expensive, but maybe
not by much.  The existing "randomization" rule for where to insert new
items in a run of identical index entries would go away, because the
insertion point would become deterministic.  I am not sure if that's
good or bad for insertion performance, but it would likely help for
scan performance.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-24 Thread Stephen Frost
* Peter Geoghegan (p...@bowt.ie) wrote:
> On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby  wrote:
> > I was able to see great improvement without planner parameters by REINDEX 
> > the
> > timestamp index.  My theory is that the index/planner doesn't handle well 
> > the
> > case of many tuples with same column value, and returns pages out of logical
> > order.  Reindex fixes that, rewriting the index data with pages in order
> > (confirmed with pageinspect), which causes index scans to fetch heap data 
> > more
> > or less monotonically (if not consecutively).  strace shows that consecutive
> > read()s are common (without intervening seeks).  I gather this allows the OS
> > readahead to kick in.
> 
> The basic problem is that the B-Tree code doesn't maintain this
> property. However, B-Tree index builds will create an index that
> initially has this property, because the tuplesort.c code happens to
> sort index tuples with a CTID tie-breaker.
> 
> > Postgres seems to assume that the high degree of correlation of the table
> > column seen in pg_stats is how it will get data from the index scan, which
> > assumption seems to be very poor on what turns out to be a higly fragmented
> > index.  Is there a way to help it to understand otherwise??
> 
> Your complaint is vague. Are you complaining about the planner making
> a poor choice? I don't think that's the issue here, because you never
> made any firm statement about the planner making a choice that was
> worth than an alternative that it had available.
> 
> If you're arguing for the idea that B-Trees should reliably keep
> tuples in order by a tie-break condition, that seems difficult to
> implement, and likely not worth it in practice.

The ongoing discussion around how to effectively handle duplicate values
in B-Tree seems relevant to this.  In particular, if we're able to store
duplicate values efficiently and all the locations under a single key
are easily available then we could certainly sort those prior to going
and visiting them.

That's not quite the same as keeping the tuples in order in the heap,
but would more-or-less achieve the effect desired, I believe?

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] index fragmentation on insert-only table with non-unique column

2016-05-24 Thread Peter Geoghegan
On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby  wrote:
> I was able to see great improvement without planner parameters by REINDEX the
> timestamp index.  My theory is that the index/planner doesn't handle well the
> case of many tuples with same column value, and returns pages out of logical
> order.  Reindex fixes that, rewriting the index data with pages in order
> (confirmed with pageinspect), which causes index scans to fetch heap data more
> or less monotonically (if not consecutively).  strace shows that consecutive
> read()s are common (without intervening seeks).  I gather this allows the OS
> readahead to kick in.

The basic problem is that the B-Tree code doesn't maintain this
property. However, B-Tree index builds will create an index that
initially has this property, because the tuplesort.c code happens to
sort index tuples with a CTID tie-breaker.

> Postgres seems to assume that the high degree of correlation of the table
> column seen in pg_stats is how it will get data from the index scan, which
> assumption seems to be very poor on what turns out to be a higly fragmented
> index.  Is there a way to help it to understand otherwise??

Your complaint is vague. Are you complaining about the planner making
a poor choice? I don't think that's the issue here, because you never
made any firm statement about the planner making a choice that was
worth than an alternative that it had available.

If you're arguing for the idea that B-Trees should reliably keep
tuples in order by a tie-break condition, that seems difficult to
implement, and likely not worth it in practice.

-- 
Peter Geoghegan


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance