Re: Filtering on an enum field in a foreign table

2019-07-16 Thread Vladimir Ryabtsev
Wait folks,

I realized that if I create a basic view with enum fields converted to
text, it does the trick! I query the view with text predicate and the view
implementation (I think) converts it into enums when querying the
underlying table.

But I still think it should work without such workarounds...

вт, 16 июл. 2019 г. в 18:06, Vladimir Ryabtsev :

> Sorry, the version() is
>
> "PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg14.04+1) on x86_64-pc-linux-gnu,
> compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit"
>
> I gave use_remote_estimate a try but unfortunately it is the same.
>
> Additionally I see on your page (in "Remote Execution Options"):
>
> "By default, only WHERE clauses using built-in operators and functions
> will be considered for execution on the remote server. Clauses involving
> non-built-in functions are checked locally after rows are fetched."
>
> I think enum types somehow fall into the same category and they are
> filtered only locally, which is seen in the plan (Filter clause).
> If I use only columns of built-in type in the predicate everything works
> as expected (with filtering on the remote server).
>
> I need a workaround to make this query execute remotely. One option may be
> using a materialized view with these enum values converted to text but then
> I will need to refresh this view periodically on the remote server.
> And actually it looks like a performance bug in the DBMS...
>
>
> вт, 16 июл. 2019 г. в 17:45, Nikolay Samokhvalov :
>
>> Hi,
>>
>> On Tue, Jul 16, 2019 at 4:00 PM Vladimir Ryabtsev 
>> wrote:
>>
>>> I am querying a remote server through a foreign table definition.
>>>
>>> CREATE TABLE example (id integer, product product_enum, status
>>> status_enum)
>>>
>> ...
>>
>>> When I am querying the foreign table on enum predicate like
>>>
>>> select * from example where product = 'a' and status = 'active'
>>>
>>> I see that filtering happens on my server which can be seen in the plan
>>> and can be felt from the query performance (indices are not used of course).
>>>
>>
>> What Postgres version do you use?
>>
>> Any changes in plans if you collect stats on the FDW table ("analyze
>> example;")?
>>
>> Have you considered changing the option "use_remote_estimate" (see
>> https://www.postgresql.org/docs/current/postgres-fdw.html#id-1.11.7.42.10)?
>>
>>
>>
>


Re: Filtering on an enum field in a foreign table

2019-07-16 Thread Vladimir Ryabtsev
Sorry, the version() is

"PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg14.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit"

I gave use_remote_estimate a try but unfortunately it is the same.

Additionally I see on your page (in "Remote Execution Options"):

"By default, only WHERE clauses using built-in operators and functions will
be considered for execution on the remote server. Clauses involving
non-built-in functions are checked locally after rows are fetched."

I think enum types somehow fall into the same category and they are
filtered only locally, which is seen in the plan (Filter clause).
If I use only columns of built-in type in the predicate everything works as
expected (with filtering on the remote server).

I need a workaround to make this query execute remotely. One option may be
using a materialized view with these enum values converted to text but then
I will need to refresh this view periodically on the remote server.
And actually it looks like a performance bug in the DBMS...


вт, 16 июл. 2019 г. в 17:45, Nikolay Samokhvalov :

> Hi,
>
> On Tue, Jul 16, 2019 at 4:00 PM Vladimir Ryabtsev 
> wrote:
>
>> I am querying a remote server through a foreign table definition.
>>
>> CREATE TABLE example (id integer, product product_enum, status
>> status_enum)
>>
> ...
>
>> When I am querying the foreign table on enum predicate like
>>
>> select * from example where product = 'a' and status = 'active'
>>
>> I see that filtering happens on my server which can be seen in the plan
>> and can be felt from the query performance (indices are not used of course).
>>
>
> What Postgres version do you use?
>
> Any changes in plans if you collect stats on the FDW table ("analyze
> example;")?
>
> Have you considered changing the option "use_remote_estimate" (see
> https://www.postgresql.org/docs/current/postgres-fdw.html#id-1.11.7.42.10)?
>
>
>


Filtering on an enum field in a foreign table

2019-07-16 Thread Vladimir Ryabtsev
I am querying a remote server through a foreign table definition.

CREATE TABLE example (id integer, product product_enum, status status_enum)

Where

CREATE TYPE status AS ENUM ('active', 'testing', 'inactive', ...);
CREATE TYPE product AS ENUM ('a', 'b', 'c', ...);

I re-created enums on my server and created a foreign table as follows:

CREATE FOREIGN TABLE example (id integer, product product_enum, status
status_enum)
SERVER remote;

When I am querying the foreign table on enum predicate like

select * from example where product = 'a' and status = 'active'

I see that filtering happens on my server which can be seen in the plan and
can be felt from the query performance (indices are not used of course).

I tried to cheat this thing by defining the enum fields as text in the
foreign table but then the remote query fails with

ERROR: operator does not exist: public.product = text HINT: No operator
matches the given name and argument type(s). You might need to add explicit
type casts.

This is ridiculous. Is there a way to workaround this and force it execute
the remote query as is?

Regards,
Vlad


Re: Why Postgres doesn't use TID scan?

2018-12-19 Thread Vladimir Ryabtsev
> The fundamental issue is that "ANY" has two meanings in PG, one of them
following the SQL standard and one not:

Oh yes, I was aware about two forms but it did not come into my mind, I was
thinking I use the same form in both cases since my query returns only one
row and column.
Thanks for pointing me into that.

--
Vlad


Re: Why Postgres doesn't use TID scan?

2018-12-19 Thread Vladimir Ryabtsev
> The workaround is to do it like this instead:

Strange, I tried to do like this, but the first thing came into my
mind was array_agg()
not array():

delete from log
where ctid = any(
select array_agg(ctid) from (
select ctid from log
where timestamp < now() at time zone 'pst' - interval '2 month'
limit 10
) v);

This query complained like this:

ERROR: operator does not exist: tid = tid[]
LINE 2: where ctid = any(
   ^
HINT: No operator matches the given name and argument type(s). You might
need to add explicit type casts.

Which is strange because both array(select ...) and select array_agg() ...
return the same datatype ctid[].

> But of course that's still an ugly hack.

Come on... Due to declarative nature of SQL developers sometimes need to
write much dirtier and uglier hacks.
This one is just a fluffy hacky.

--
Vlad


Re: Why Postgres doesn't use TID scan?

2018-12-17 Thread Vladimir Ryabtsev
I can't believe it.
I see some recommendations in Internet to do like this (e.g.
https://stackoverflow.com/questions/5170546/how-do-i-delete-a-fixed-number-of-rows-with-sorting-in-postgresql
).
Did it really work in 2011? Are you saying they broke it? It's a shame...

Anyway I think the problem is pretty clear: I want to eventually clear the
table based on the predicate but I don't want to lock it for a long time.
The table does not have a primary key.
What should be a proper solution?

--
Vlad

пн, 17 дек. 2018 г. в 17:40, Tom Lane :

> Vladimir Ryabtsev  writes:
> > I want to clean a large log table by chunks. I write such a query:
> > delete from categorization.log
> > where ctid in (
> > select ctid from categorization.log
> > where timestamp < now() - interval '2 month'
> > limit 1000
> > )
>
> > Why does this query want to use Seq Scan and Sort on a 423M rows table?
>
> There's no support for using ctid as a join key in this way; specifically,
> nodeTidscan.c doesn't have support for being a parameterized inner scan,
> nor does tidpath.c have code to generate such a plan.  The header comments
> for the latter say
>
>  * There is currently no special support for joins involving CTID; in
>  * particular nothing corresponding to best_inner_indexscan().  Since it's
>  * not very useful to store TIDs of one table in another table, there
>  * doesn't seem to be enough use-case to justify adding a lot of code
>  * for that.
>
> Queries like yours are kinda sorta counterexamples to that, but pretty
> much all the ones I've seen seem like crude hacks (and this one is not
> an exception).  Writing a bunch of code to support them feels like
> solving the wrong problem.  Admittedly, it's not clear to me what the
> right problem to solve instead would be.
>
> (It's possible that I'm overestimating the amount of new code that would
> be needed to implement this, however.  indxpath.c is pretty huge, but
> that's mostly because there are so many cases to consider.  There'd only
> be one interesting case for an inner TID scan.  Also, this comment is
> ancient, predating the current approach with parameterized paths ---
> in fact best_inner_indexscan doesn't exist as such anymore.  So maybe
> that old judgment that it'd take a lot of added code is wrong.)
>
> regards, tom lane
>


Why Postgres doesn't use TID scan?

2018-12-17 Thread Vladimir Ryabtsev
I want to clean a large log table by chunks. I write such a query:

delete from categorization.log
where ctid in (
select ctid from categorization.log
where timestamp < now() - interval '2 month'
limit 1000
)

But I am getting the following weird plan:

[Plan 1]
Delete on log  (cost=74988058.17..77101421.77 rows=211334860 width=36)
  ->  Merge Semi Join  (cost=74988058.17..77101421.77 rows=211334860
width=36)
Merge Cond: (log.ctid = "ANY_subquery".ctid)
->  Sort  (cost=74987967.33..76044641.63 rows=422669720 width=6)
  Sort Key: log.ctid
  ->  Seq Scan on log  (cost=0.00..8651368.20 rows=422669720
width=6)
->  Sort  (cost=90.83..93.33 rows=1000 width=36)
  Sort Key: "ANY_subquery".ctid
  ->  Subquery Scan on "ANY_subquery"  (cost=0.00..41.00
rows=1000 width=36)
->  Limit  (cost=0.00..31.00 rows=1000 width=6)
  ->  Seq Scan on log log_1
(cost=0.00..11821391.10 rows=381284367 width=6)
Filter: ("timestamp" < (now() - '2
mons'::interval))

And it takes infinity to complete (with any number in LIMIT from 1 to 1000).

However if I extract CTIDs manually:

select array_agg(ctid) from (
select ctid from s.log
where timestamp < now() - interval '2 month'
limit 5
) v

and substitute the result inside the DELETE query, it does basic TID scan
and completes in just milliseconds:

explain
delete from s.log
where ctid =
any('{"(3020560,1)","(3020560,2)","(3020560,3)","(3020560,4)","(3020560,5)"}'::tid[])

[Plan 2]
Delete on log  (cost=0.01..20.06 rows=5 width=6)
  ->  Tid Scan on log  (cost=0.01..20.06 rows=5 width=6)
TID Cond: (ctid = ANY
('{"(3020560,1)","(3020560,2)","(3020560,3)","(3020560,4)","(3020560,5)"}'::tid[]))

In case the table's definition helps:

CREATE TABLE s.log
(
article_id bigint NOT NULL,
topic_id integer NOT NULL,
weight double precision NOT NULL,
cat_system character varying(50) NOT NULL,
lang character varying(5) NOT NULL,
is_final boolean NOT NULL,
comment character varying(50),
"timestamp" timestamp without time zone DEFAULT now()
)

Number of rows ~ 423M
n_live_tup = 422426725
last_vacuum = 2018-10-22
Postgres version(): PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg14.04+1) on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4,
64-bit

Why does this query want to use Seq Scan and Sort on a 423M rows table?
How to fix this (reduce it to Plan 2)?

--
Vlad


Re: Gained %20 performance after disabling bitmapscan

2018-10-19 Thread Vladimir Ryabtsev
Yavuz, cannot add much to other points but as for index-only scan, an
(auto)vacuum must be run in order to optimizer understand it can utilize
index-only scan. Please check if autovacuum was run on the table after
index creation and if no, run it manually.

Vlad


Re: Why could different data in a table be processed with different performance?

2018-10-10 Thread Vladimir Ryabtsev
FYI, posting an intermediate update on the issue.

I disabled index scans to keep existing order, and copied part of the
"slow" range into another table (3M rows in 2.2 GB table + 17 GB toast). I
was able to reproduce slow readings from this copy. Then I performed
CLUSTER of the copy using PK and everything improved significantly. Overall
time became 6 times faster with disk read speed (reported by iotop)
30-60MB/s.

I think we can take bad physical data distribution as the main hypothesis
of the issue. I was not able to launch seekwatcher though (it does not work
out of the box in Ubuntu and I failed to rebuild it) and confirm lots of
seeks.

I still don't have enough disk space to solve the problem with original
table, I am waiting for this from admin/devops team.

My plan is to partition the original table and CLUSTER every partition on
primary key once I have space.

Best regards,
Vlad


Re: Why the index is not used ?

2018-10-07 Thread Vladimir Ryabtsev
Additionally it is not clear why you want to search in table on encrypted
data. Usually you match user with it's unpersonalized data (such as login,
user ID) and then decrypt personalized data. If you need to store user
identifying data encrypted as well (e.g. bank account number) you can use a
deterministic algorithm for it (without salt) because it is guaranteed to
be unique and you don't need to have different encrypted data for two same
input strings.

Vlad


Re: Why the index is not used ?

2018-10-07 Thread Vladimir Ryabtsev
Didier,

you was given a few things to check in another my message on the same day.
You have not provided any feedback.
It is up to you how to implement your system, but you can with no doubt
consider your database as not encrypted with your approach. You (or
probably your management) have no understanding from which risks you
protect your data.

Regards,
Vlad


вс, 7 окт. 2018 г. в 11:33, ROS Didier :

> Hi Francisco
>
> Thank you for your remark.
> You're right, but it's the only procedure I found to make search
> on encrypted fields with good response times (using index) !
>
> Regarding access to the file system, our servers are in protected
> network areas. few people can connect to it.
>
> it's not the best solution, but we have data encryption needs and
> good performance needs too. I do not know how to do it except the specified
> procedure..
> if anyone has any proposals to put this in place, I'm interested.
>
> Thanks in advance
>
> Best Regards
> Didier ROS
>
> -Message d'origine-
> De : fola...@peoplecall.com [mailto:fola...@peoplecall.com]
> Envoyé : dimanche 7 octobre 2018 17:58
> À : ROS Didier 
> Cc : pavel.steh...@gmail.com; pgsql-...@lists.postgresql.org;
> pgsql-performance@lists.postgresql.org; pgsql-gene...@lists.postgresql.org
> Objet : Re: Why the index is not used ?
>
> ROS:
>
> On Sun, Oct 7, 2018 at 3:13 PM, ROS Didier  wrote:
> 
> > -INSERT INTO cartedecredit(username,cc)  SELECT 'individu ' ||
> x.id, pgp_sym_encrypt('test value ' || x.id,
> 'motdepasse','compress-algo=2, cipher-algo=aes256') FROM
> generate_series(1,10) AS x(id);
> > -CREATE INDEX idx_cartedecredit_cc02 ON
> cartedecredit(pgp_sym_decrypt(cc, 'motdepasse','compress-algo=2,
> cipher-algo=aes256'));
>
> If my french is not too rusty you are encrypting a credit-card, and then
> storing an UNENCRYPTED copy in the index. So, getting it from the server is
> trivial for anyone with filesystem access.
>
> Francisco Olarte.
>
>
>
> Ce message et toutes les pièces jointes (ci-après le 'Message') sont
> établis à l'intention exclusive des destinataires et les informations qui y
> figurent sont strictement confidentielles. Toute utilisation de ce Message
> non conforme à sa destination, toute diffusion ou toute publication totale
> ou partielle, est interdite sauf autorisation expresse.
>
> Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
> le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
> partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
> votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
> sur quelque support que ce soit. Nous vous remercions également d'en
> avertir immédiatement l'expéditeur par retour du message.
>
> Il est impossible de garantir que les communications par messagerie
> électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
> erreur ou virus.
> 
>
> This message and any attachments (the 'Message') are intended solely for
> the addressees. The information contained in this Message is confidential.
> Any use of information contained in this Message not in accord with its
> purpose, any dissemination or disclosure, either whole or partial, is
> prohibited except formal approval.
>
> If you are not the addressee, you may not copy, forward, disclose or use
> any part of it. If you have received this message in error, please delete
> it and all copies from your system and notify the sender immediately by
> return message.
>
> E-mail communication cannot be guaranteed to be timely secure, error or
> virus-free.
>


Re: Why the index is not used ?

2018-10-06 Thread Vladimir Ryabtsev
Hello Didier,

(3), (5) to find the match, you decrypt the whole table, apparently this
take quite a long time.
Index cannot help here because indexes work on exact match of type and
value, but you compare mapped value, not indexed. Functional index should
help, but like it was said, it against the idea of encrypted storage.

(6) I never used pgp_sym_encrypt() but I see that in INSERT INTO you
supplied additional parameter 'compress-algo=2, cipher-algo=aes256' while
in (6) you did not. Probably this is the reason.

In general matching indexed bytea column should use index, you can ensure
in this populating the column unencrypted and using 'test value 32'::bytea
for match.
In you case I believe pgp_sym_encrypt() is not marked as STABLE or
IMMUTABLE that's why it will be evaluated for each row (very inefficient)
and cannot use index. From documentation:

"Since an index scan will evaluate the comparison value only once, not once
at each row, it is not valid to use a VOLATILE function in an index scan
condition."
https://www.postgresql.org/docs/10/static/xfunc-volatility.html

If you cannot add STABLE/IMMUTABLE to pgp_sym_encrypt() (which apparently
should be there), you can encrypt searched value as a separate operation
and then search in the table using basic value match.

Vlad


Re: Why could different data in a table be processed with different performance?

2018-09-28 Thread Vladimir Ryabtsev
> Does your LVM have readahead
> ramped up ?  Try lvchange -r 65536 data/postgres (or similar).

Changed this from 256 to 65536.
If it is supposed to take effect immediately (no server reboot or other
changes), then I've got no changes in performance. No at all.

Vlad


Re: SELECT statement returns in 10seconds, but INSERT/CREATE TABLE AS with same SELECT takes 7 minutes

2018-09-27 Thread Vladimir Ryabtsev
> The reason for the plan shape difference is probably that the bare SELECT
> is allowed to use parallelism while INSERT/SELECT isn't.
In case parallelism is used, should it report in the plan as something like
"workers planned: N"?

Vlad


Re: Why could different data in a table be processed with different performance?

2018-09-26 Thread Vladimir Ryabtsev
> Since you have a very big toast table, given you are using spinning
disks, I think that increasing the block size will bring benefits.
But will it worsen caching? I will have lesser slots in cache. Also will it
affect required storage space?

>> consecutive runs with SAME parameters do NOT hit the disk, only the
first one does, consequent ones read only from buffer cache.
> I m  a bit confused.. every query you pasted contains 'read':
>Buffers: shared hit=50 read=2378
> and 'read' means you are reading from disk (or OS cache). Or not?
Yes, sorry, it was just my misunderstanding of what is "consecutive". To
make it clear: I iterate over all data in table with one request and
different parameters on each iteration (e.g. + 5000 both borders), in this
case I get disk reads on each query run (much more reads on "slow" range).
But if I request data from an area queried previously, it reads from cache
and does not hit disk (both ranges). E.g. iterating over 1M of records with
empty cache takes ~11 minutes in "fast" range and ~1 hour in "slow" range,
while on second time it takes only ~2 minutes for both ranges (if I don't
do drop_caches).

Regards,
Vlad


Re: Why could different data in a table be processed with different performance?

2018-09-25 Thread Vladimir Ryabtsev
> 1) Which file system are you using?
>From Linux's view it's ext4. Real vmdx file on Hyper-V is stored on NTFS,
as far as I know.

> 2) What is the segment layout of the LVM PVs and LVs?
I am a bit lost with it. Is that what you are asking about?
master:
# pvs --segments
  PV VGFmt  Attr PSize  PFree  Start SSize
  /dev/sda5  ubuntu-vg lvm2 a--  19.76g 20.00m 0  4926
  /dev/sda5  ubuntu-vg lvm2 a--  19.76g 20.00m  4926   127
  /dev/sda5  ubuntu-vg lvm2 a--  19.76g 20.00m  5053 5
# lvs --segments
  LV VGAttr  #Str Type   SSize
  root   ubuntu-vg -wi-ao---1 linear  19.24g
  swap_1 ubuntu-vg -wi-ao---1 linear 508.00m

slave:
# pvs --segments
  PV VG  Fmt  Attr PSize   PFree Start SSize
  /dev/sda3  postgresnlpslave-vg lvm2 a--  429.77g0  0 110021
  /dev/sda5  postgresnlpslave-vg lvm2 a--  169.52g0  0  28392
  /dev/sda5  postgresnlpslave-vg lvm2 a--  169.52g0  28392   2199
  /dev/sda5  postgresnlpslave-vg lvm2 a--  169.52g0  30591   2560
  /dev/sda5  postgresnlpslave-vg lvm2 a--  169.52g0  33151  10246
  /dev/sdb1  postgresnlpslave-vg lvm2 a--  512.00g0  0 131071
# lvs --segments
  LV VG  Attr   #Str Type   SSize
  root   postgresnlpslave-vg -wi-ao1 linear 110.91g
  root   postgresnlpslave-vg -wi-ao1 linear  40.02g
  root   postgresnlpslave-vg -wi-ao1 linear  10.00g
  root   postgresnlpslave-vg -wi-ao1 linear 429.77g
  root   postgresnlpslave-vg -wi-ao1 linear 512.00g
  swap_1 postgresnlpslave-vg -wi-ao1 linear   8.59g

> 3) Do you use LVM for any "extra" features, such as snapshots?
I don't think so, but how to check? vgs gives #SN = 0, is that it?

> 4) You can try using seekwatcher to see where on the disk the slowness is
occurring. You get a chart similar to this
http://kernel.dk/dd-md0-xfs-pdflush.png
> 5) BCC is a collection of tools that might shed a light on what is
happening. https://github.com/iovisor/bcc
Will look into it.

Regards,
Vlad


Re: Why could different data in a table be processed with different performance?

2018-09-24 Thread Vladimir Ryabtsev
> If it does an index scan, I think that will badly fail to keep the same
order of heap TIDs - it'll be inserting rows in ID order rather than in (I
guess) reverse ID order.
According to the plan, it's gonna be seq. scan with filter.

Vlad


Re: Why could different data in a table be processed with different performance?

2018-09-24 Thread Vladimir Ryabtsev
> did you try either 1) forcing a bitmap scan (of only one index), to force
the heap reads to be ordered, if not sequential?  SET enable_indexscan=off
(and maybe SET enable_seqscan=off and others as needed).
Disabling index scan made it bitmap.
It is surprising, but this increased read speed in both ranges.
It came two times for "fast" range and 3 times faster for "slow" range (for
certain segments of data I checked on, the whole experiment takes a while
though).
But there is still a difference between the ranges, it became now ~20 MB/s
vs ~6 MB/s.

Vlad


Re: Why could different data in a table be processed with different performance?

2018-09-24 Thread Vladimir Ryabtsev
> This seems significant..it means the heap was probably written in
backwards
order relative to the IDs, and the OS readahead is ineffective when index
scanning across a range of IDs.
But again, why is it different for one range and another? It was reversed
for both ranges.

> I would definitely want to make that a partitioned table
Yes, I believe it will be partitioned in the future.

> I *suspect* VACUUM FULL won't help, since (AIUI) it copies all "visible"
tuples from the source table into a new table (and updates indices as
necessary).  It can resolve bloat due to historic DELETEs, but since I
think your table was written in reverse order of pkey, I think it'll also
copy it in reverse order.
I am going copy the slow range into a table nearby and see if it reproduces
(I hope "INSERT INTO t2 SELECT * FROM t1 WHERE ..." will keep existing
order of rows). Then I could try the same after CLUSTER.

Regards,
Vlad


Re: Why could different data in a table be processed with different performance?

2018-09-24 Thread Vladimir Ryabtsev
> You can create 2 partial indexes and the planner will pick it up for you.
(and the planning time will go a bit up).
Created two partial indexes and ensured planner uses it. But the result is
still the same, no noticeable difference.

> it is not unusual to have 1GB cache or more...  and do not forget to drop
the cache between tests + do a sync
I conducted several long runs of dd, so I am sure that this numbers are
fairly correct. However, what worries me is that I test sequential read
speed while during my experiments Postgres might need to read from random
places thus reducing real read speed dramatically. I have a feeling that
this can be the reason.
I also reviewed import scripts and found the import was done in DESCENDING
order of IDs. It was so to get most recent records sooner, may be it caused
some inefficiency in the storage... But again, it was so for both ranges.

> - how big is your index?
pg_table_size('articles_pkey') = 1561 MB

> - how big is the table?
pg_table_size('articles') = 427 GB
pg_table_size('pg_toast.pg_toast_221558') = 359 GB

> - given the size of shared_buffers, almost 2M blocks should fit, but you
say 2 consecutive runs still are hitting the disk. That's strange indeed
since you are using way more than 2M blocks.
TBH, I cannot say I understand your calculations with number of blocks...
But to clarify: consecutive runs with SAME parameters do NOT hit the disk,
only the first one does, consequent ones read only from buffer cache.

> Did you check that perhaps are there any other processes or cronjobs (on
postgres and on the system) that are maybe reading data and flushing out
the cache?
I checked with iotop than nothing else reads intensively from any disk in
the system. And again, the result is 100% reproducible and depends on ID
range only, if there were any thing like these I would have noticed some
fluctuations in results.

> You can make use of pg_buffercache in order to see what is actually
cached.
It seems that there is no such a view in my DB, could it be that the module
is not installed?

> - As Laurenz suggested (VACUUM FULL), you might want to move data around.
You can try also a dump + restore to narrow the problem to data or disk
I launched VACUUM FULL, but it ran very slowly, according to my calculation
it might take 17 hours. I will try to do copy data into another table with
the same structure or spin up another server, and let you know.

> - You might also want to try to see the disk graph of Windows, while you
are running your tests. It can show you if data (and good to know how much)
is actually fetching from disk or not.
I wanted to do so but I don't have access to Hyper-V server, will try to
request credentials from admins.

Couple more observations:
1) The result of my experiment is almost not affected by other server load.
Another user was running a query (over this table) with read speed ~130
MB/s, while with my query read at 1.8-2 MB/s.
2) iotop show higher IO % (~93-94%) with slower read speed (though it is
not quite clear what this field is). A process from example above had ~55%
IO with 130 MB/s while my process had ~93% with ~2MB/s.

Regards,
Vlad


Re: Why could different data in a table be processed with different performance?

2018-09-24 Thread Vladimir Ryabtsev
> Another idea is that the operating system rearranges I/O in a way that
is not ideal for your storage.
> Try a different I/O scheduler by running
echo deadline > /sys/block/sda/queue/scheduler

My scheduler was already "deadline".
In some places I read that in virtual environment sometimes "noop"
scheduler is better, so I tried it. However the experiment shown NO
noticeable difference between them (look "deadline":
https://i.stack.imgur.com/wCOJW.png, "noop":
https://i.stack.imgur.com/lB33u.png). At the same time tests show almost
similar patterns in changing read speed when going over the "slow" range.

Vlad

чт, 20 сент. 2018 г. в 20:17, Laurenz Albe :

> Vladimir Ryabtsev wrote:
> > explain (analyze, buffers)
> > select count(*), sum(length(content::text)) from articles where
> article_id between %s and %s
> >
> > Sample output:
> >
> > Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual
> time=6625.993..6625.995 rows=1 loops=1)
> >   Buffers: shared hit=26847 read=3914
> >   ->  Index Scan using articles_pkey on articles  (cost=0.57..8573.35
> rows=5005 width=107) (actual time=21.649..1128.004 rows=5000 loops=1)
> > Index Cond: ((article_id >= 43800) AND (article_id <=
> 438005000))
> > Buffers: shared hit=4342 read=671
> > Planning time: 0.393 ms
> > Execution time: 6626.136 ms
> >
> > Aggregate  (cost=5533.02..5533.03 rows=1 width=16) (actual
> time=33219.100..33219.102 rows=1 loops=1)
> >   Buffers: shared hit=6568 read=7104
> >   ->  Index Scan using articles_pkey on articles  (cost=0.57..5492.96
> rows=3205 width=107) (actual time=22.167..12082.624 rows=2416 loops=1)
> > Index Cond: ((article_id >= '10002100'::bigint) AND
> (article_id <= '10002101'::bigint))
> > Buffers: shared hit=50 read=2378
> > Planning time: 0.517 ms
> > Execution time: 33219.218 ms
> >
> > During iteration, I parse the result of EXPLAIN and collect series of
> following metrics:
> >
> > - buffer hits/reads for the table,
> > - buffer hits/reads for the index,
> > - number of rows (from "Index Scan..."),
> > - duration of execution.
> >
> > Based on metrics above I calculate inherited metrics:
> >
> > - disk read rate: (index reads + table reads) * 8192 / duration,
> > - reads ratio: (index reads + table reads) / (index reads + table reads
> + index hits + table hits),
> > - data rate: (index reads + table reads + index hits + table hits) *
> 8192 / duration,
> > - rows rate: number of rows / duration.
> >
> > Since "density" of IDs is different in "small" and "big" ranges, I
> adjusted
> > size of chunks in order to get around 5000 rows on each iteration in
> both cases,
> > though my experiments show that chunk size does not really matter a lot.
> >
> > The issue posted at the very beginning of my message was confirmed for
> the
> > *whole* first and second ranges (so it was not just caused by randomly
> cached data).
> >
> > To eliminate cache influence, I restarted Postgres server with flushing
> buffers:
> >
> > /$ postgresql stop; sync; echo 3 > /proc/sys/vm/drop_caches; postgresql
> start
> >
> > After this I repeated the test and got next-to-same picture.
> >
> > "Small' range: disk read rate is around 10-11 MB/s uniformly across the
> test.
> > Output rate was 1300-1700 rows/s. Read ratio is around 13% (why?
> Shouldn't it be
> > ~ 100% after drop_caches?).
> > "Big" range: In most of time disk read speed was about 2 MB/s but
> sometimes
> > it jumped to 26-30 MB/s. Output rate was 70-80 rows/s (but varied a lot
> and
> > reached 8000 rows/s). Read ratio also varied a lot.
> >
> > I rendered series from the last test into charts:
> > "Small" range: https://i.stack.imgur.com/3Zfml.png
> > "Big" range (insane): https://i.stack.imgur.com/VXdID.png
> >
> > During the tests I verified disk read speed with iotop and found its
> indications
> > very close to ones calculated by me based on EXPLAIN BUFFERS. I cannot
> say I was
> > monitoring it all the time, but I confirmed it when it was 2 MB/s and 22
> MB/s on
> > the second range and 10 MB/s on the first range. I also checked with
> htop that
> > CPU was not a bottleneck and was around 3% during the tests.
> >
> > The issue is reproducible on both master and slave servers. My tests
> were conducted
> > on slave, while there were no any other load on DBMS, or disk activity
> on the
> 

Re: Why could different data in a table be processed with different performance?

2018-09-22 Thread Vladimir Ryabtsev
> is the length of the text equally distributed over the 2 partitions?
Not 100% equally, but to me it does not seem to be a big deal...
Considering the ranges independently:
First range: ~70% < 10 KB, ~25% for 10-20 KB, ~3% for 20-30 KB, everything
else is less than 1% (with 10 KB steps).
Second range: ~80% < 10 KB, ~18% for 10-20 KB, ~2% for 20-30 KB, everything
else is less than 1% (with 10 KB steps).

>From what you posted, the first query retrieves 5005 rows, but the second
2416. It might be helpful if we are able to compare 5000 vs 5000
Yes it was just an example, here are the plans for approximately same
number of rows:

Aggregate  (cost=9210.12..9210.13 rows=1 width=16) (actual
time=4265.478..4265.479 rows=1 loops=1)
  Buffers: shared hit=27027 read=4311
  I/O Timings: read=2738.728
  ->  Index Scan using articles_pkey on articles  (cost=0.57..9143.40
rows=5338 width=107) (actual time=12.254..873.081 rows=5001 loops=1)
Index Cond: ((article_id >= 43803) AND (article_id <=
438035000))
Buffers: shared hit=4282 read=710
I/O Timings: read=852.547
Planning time: 0.235 ms
Execution time: 4265.554 ms

Aggregate  (cost=11794.59..11794.60 rows=1 width=16) (actual
time=62298.559..62298.559 rows=1 loops=1)
  Buffers: shared hit=15071 read=14847
  I/O Timings: read=60703.859
  ->  Index Scan using articles_pkey on articles  (cost=0.57..11709.13
rows=6837 width=107) (actual time=24.686..24582.221 rows=5417 loops=1)
Index Cond: ((article_id >= '10002104'::bigint) AND (article_id
<= '10002106'::bigint))
Buffers: shared hit=195 read=5244
I/O Timings: read=24507.621
Planning time: 0.494 ms
Execution time: 62298.630 ms

If we subtract I/O from total time, we get 1527 ms vs 1596 ms — very close
timings for other than I/O operations (considering slightly higher number
of rows in second case). But  I/O time differs dramatically.

> Also is worth noticing that the 'estimated' differs from 'actual' on the
second query. I think that happens because data is differently distributed
over the ranges. Probably the analyzer does not have enough samples to
understand the real distribution.
I think we should not worry about it unless the planner chose poor plan,
should we? Statistics affects on picking a proper plan, but not on
execution of the plan, doesn't it?

> You might try to increase the number of samples (and run analyze)
To be honest, I don't understand it... As I know, in Postgres we have two
options: set column target percentile and set n_distinct. We can't increase
fraction of rows analyzed (like in other DBMSs we can set ANALYZE
percentage explicitly). Moreover, in our case the problem column is PRIMARY
KEY with all distinct values, Could you point me, what exactly should I do?

> or to create partial indexes on the 2 ranges.
Sure, will try it with partial indexes. Should I drop existing PK index, or
ensuring that planner picks range index is enough?

> i would do a sync at the end, after dropping caches.
A bit off-topic, but why? Doing sync may put something to cache again.
https://linux-mm.org/Drop_Caches
https://unix.stackexchange.com/a/82164/309344

> - does the raid controller have a cache?
> - how big is the cache? (when you measure disk speed, that will influence
the result very much, if  you do not run the test on big-enough data chunk)
best if is disabled during your tests
I am pretty sure there is some, usually it's several tens of megabytes, but
I ran disk read tests several times with chunks that could not be fit in
the cache and with random offset, so I am pretty sure that something around
500 MB/s is enough reasonably accurate (but it is only for sequential read).

> - is the OS caching disk blocks too? maybe you want to drop everything
from there too.
How can I find it out? And how to drop it? Or you mean hypervisor OS?
Anyway, don't you think that caching specifics could not really explain
these issues?

> I think that you should be pragmatic and try to run the tests on a
physical machine.
I wish I could do it, but hardly it is possible. In some future we may
migrate the DB to physical hosts, but now we need to make it work in
virtual.

> on the VM or on the physical host?
On the VM. The physical host is Windows (no iotop) and I have no access to
it.

Vlad


Re: Why could different data in a table be processed with different performance?

2018-09-20 Thread Vladimir Ryabtsev
> Was the data populated differently, too ?
Here is how new records were coming in last two month, by days:
https://i.stack.imgur.com/zp9WP.png During a day, records come evenly (in
both ranges), slightly faster in Europe and American work time.

Since Jul 1, 2018, when we started population by online records, trend was
approximately same as before Aug 04, 2018 (see picture). Then it changed
for "big" range, we now in some transition period until it stabilizes.

We also have imported historical data massively from another system. First
part was the range with big numbers, they were added in couple of days,
second part was range with small numbers, it took around a week. Online
records were coming uninterruptedly during the import.

Rows are updated rarely and almost never deleted.

Here is distribution of JSONB field length (if converted to ::text) in last
5 days:
<10KB: 665066
10-20KB: 225697
20-30KB: 25640
30-40KB: 6678
40-50KB: 2100
50-60KB: 1028
Other (max 2.7MB): 2248 (only single exemplars larger than 250KB)

> Has the table been reindexed (or pg_repack'ed) since loading (or vacuumed
for that matter) ?
Not sure what you mean... We created indexes on some fields (on
appended_at, published_at, source_id).
When I came across the problem I noticed that table is not being vacuumed.
I then ran VACUUM ANALYZE manually but it did not change anything about the
issue.

> Were the tests run when the DB was otherwise idle?
Yes, like I said, my test were performed on slave, the were no any other
users connected (only me monitoring sessions from pgAdmin), and I never
noticed any significant I/O from processes other than postgres (only light
load from replication).

> You can see the index scan itself takes an additional 11sec, the "heap"
portion takes the remaining, additional 14sec (33s-12s-7s).
Sorry, I see 33 s total and 12 s for index, where do you see 7 s?

> I guess you mean buffers cache hit ratio: read/hit, which I think should
actually be read/(hit+read).
I will quote myself:
> reads ratio: (index reads + table reads) / (index reads + table reads +
index hits + table hits)
So yes, you are right, it is.

+ Some extra info about my system from QA recommendations:

OS version: Ubuntu 16.04.2 LTS / xenial

~$ time dd if=/dev/mapper/postgresnlpslave--vg-root of=/dev/null bs=1M
count=32K skip=$((128*$RANDOM/32))
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 62.1574 s, 553 MB/s
0.05user 23.13system 1:02.15elapsed 37%CPU (0avgtext+0avgdata
3004maxresident)k
67099496inputs+0outputs (0major+335minor)pagefaults 0swaps

DBMS is accessed directly (no pgpool, pgbouncer, etc).

RAM: 58972 MB

On physical device level RAID10 is used.

Table metadata: (relname, relpages, reltuples, relallvisible, relkind,
relnatts, relhassubclass, reloptions, pg_table_size(oid)) = (articles,
7824944, 6.74338e+07, 7635864, 10, false, 454570926080)

Regards,
Vlad

чт, 20 сент. 2018 г. в 17:42, Justin Pryzby :

> On Thu, Sep 20, 2018 at 05:07:21PM -0700, Vladimir Ryabtsev wrote:
> > I am experiencing a strange performance problem when accessing JSONB
> > content by primary key.
>
> > I noticed that with some IDs it works pretty fast while with other it is
> > 4-5 times slower. It is suitable to note, there are two main 'categories'
> > of IDs in this table: first is range 27000-5, and second is
> > range 100-10003000. For the first range it is 'fast' and for
> > the second it is 'slow'.
>
> Was the data populated differently, too ?
> Has the table been reindexed (or pg_repack'ed) since loading (or vacuumed
> for
> that matter) ?
> Were the tests run when the DB was otherwise idle?
>
> You can see the index scan itself takes an additional 11sec, the "heap"
> portion
> takes the remaining, additional 14sec (33s-12s-7s).
>
> So it seems to me like the index itself is slow to scan.  *And*, the heap
> referenced by the index is slow to scan, probably due to being referenced
> by
> the index less consecutively.
>
> > "Small' range: disk read rate is around 10-11 MB/s uniformly across the
> > test. Output rate was 1300-1700 rows/s. Read ratio is around 13% (why?
> > Shouldn't it be ~ 100% after drop_caches?).
>
> I guess you mean buffers cache hit ratio: read/hit, which I think should
> actually be read/(hit+read).
>
> It's because a given buffer can be requested multiple times.  For example,
> if
> an index page is read which references multiple items on the same heap
> page,
> each heap access is counted separately.  If the index is freshly built,
> that'd
> happen nearly every item.
>
> Justin
>
> > Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual
> time=6625.993..6625.995 rows=1 loops=1)
> >   Buffers: shared hit=26847 read=3914
> >   ->  Ind

Why could different data in a table be processed with different performance?

2018-09-20 Thread Vladimir Ryabtsev
I am experiencing a strange performance problem when accessing JSONB
content by primary key.

My DB version() is PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg14.04+1) on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4,
64-bit
postgres.conf: https://justpaste.it/6pzz1
uname -a: Linux postgresnlpslave 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18
14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
The machine is virtual, running under Hyper-V.
Processor: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 1x1 cores
Disk storage: the host has two vmdx drives, first shared between the root
partition and an LVM PV, second is a single LVM PV. Both PVs are in a VG
containing swap and postgres data partitions. The data is mostly on the
first PV.

I have such a table:

CREATE TABLE articles
(
article_id bigint NOT NULL,
content jsonb NOT NULL,
published_at timestamp without time zone NOT NULL,
appended_at timestamp without time zone NOT NULL,
source_id integer NOT NULL,
language character varying(2) NOT NULL,
title text NOT NULL,
topicstopic[] NOT NULL,
objects object[] NOT NULL,
cluster_id bigint NOT NULL,
CONSTRAINT articles_pkey PRIMARY KEY (article_id)
)

We have a Python lib (using psycopg2 driver) to access this table. It
executes simple queries to the table, one of them is used for bulk
downloading of content and looks like this:

select content from articles where id between $1 and $2

I noticed that with some IDs it works pretty fast while with other it is
4-5 times slower. It is suitable to note, there are two main 'categories'
of IDs in this table: first is range 27000-5, and second is
range 100-10003000. For the first range it is 'fast' and for
the second it is 'slow'. Besides larger absolute numbers withdrawing them
from int to bigint, values in the second range are more 'sparse', which
means in the first range values are almost consequent (with very few
'holes' of missing values) while in the second range there are much more
'holes' (average filling is 35%). Total number of rows in the first range:
~62M, in the second range: ~10M.

I conducted several experiments to eliminate possible influence of
library's code and network throughput, I omit some of them. I ended up with
iterating over table with EXPLAIN to simulate read load:

explain (analyze, buffers)
select count(*), sum(length(content::text)) from articles where article_id
between %s and %s

Sample output:

Aggregate  (cost=8635.91..8635.92 rows=1 width=16) (actual
time=6625.993..6625.995 rows=1 loops=1)
  Buffers: shared hit=26847 read=3914
  ->  Index Scan using articles_pkey on articles  (cost=0.57..8573.35
rows=5005 width=107) (actual time=21.649..1128.004 rows=5000 loops=1)
Index Cond: ((article_id >= 43800) AND (article_id <=
438005000))
Buffers: shared hit=4342 read=671
Planning time: 0.393 ms
Execution time: 6626.136 ms

Aggregate  (cost=5533.02..5533.03 rows=1 width=16) (actual
time=33219.100..33219.102 rows=1 loops=1)
  Buffers: shared hit=6568 read=7104
  ->  Index Scan using articles_pkey on articles  (cost=0.57..5492.96
rows=3205 width=107) (actual time=22.167..12082.624 rows=2416 loops=1)
Index Cond: ((article_id >= '10002100'::bigint) AND (article_id
<= '10002101'::bigint))
Buffers: shared hit=50 read=2378
Planning time: 0.517 ms
Execution time: 33219.218 ms

During iteration, I parse the result of EXPLAIN and collect series of
following metrics:

- buffer hits/reads for the table,
- buffer hits/reads for the index,
- number of rows (from "Index Scan..."),
- duration of execution.

Based on metrics above I calculate inherited metrics:

- disk read rate: (index reads + table reads) * 8192 / duration,
- reads ratio: (index reads + table reads) / (index reads + table reads +
index hits + table hits),
- data rate: (index reads + table reads + index hits + table hits) * 8192 /
duration,
- rows rate: number of rows / duration.

Since "density" of IDs is different in "small" and "big" ranges, I adjusted
size of chunks in order to get around 5000 rows on each iteration in both
cases, though my experiments show that chunk size does not really matter a
lot.

The issue posted at the very beginning of my message was confirmed for the
*whole* first and second ranges (so it was not just caused by randomly
cached data).

To eliminate cache influence, I restarted Postgres server with flushing
buffers:

/$ postgresql stop; sync; echo 3 > /proc/sys/vm/drop_caches; postgresql
start

After this I repeated the test and got next-to-same picture.

"Small' range: disk read rate is around 10-11 MB/s uniformly across the
test. Output rate was 1300-1700 rows/s. Read ratio is around 13% (why?
Shouldn't it be ~ 100% after drop_caches?).
"Big" range: In most of time disk read speed was about 2 MB/s but sometimes
it jumped to 26-30 MB/s. Output rate was 70-80 rows/s (but varied a lot and
reached 8000 rows/s). Read ratio also varied a lot.

I