Re: insert and query performance on big string table with pg_trgm

2017-11-25 Thread Gábor SZŰCS
Don't know if it would make PostgreSQL happier but how about adding a hash
value column and creating the unique index on that one? May block some
false duplicates but the unique index would be way smaller, speeding up
inserts.

2017. nov. 25. 7:35 ezt írta ("Jeff Janes" ):

>
>
> On Nov 21, 2017 00:05, "Matthew Hall"  wrote:
>
>
> > Are all indexes present at the time you insert?  It will probably be
> much faster to insert without the gin index (at least) and build it after
> the load.
>
> There is some flexibility on the initial load, but the updates in the
> future will require the de-duplication capability. I'm willing to accept
> that might be somewhat slower on the load process, to get the accurate
> updates, provided we could try meeting the read-side goal I wrote about, or
> at least figure out why it's impossible, so I can understand what I need to
> fix to make it possible.
>
>
> As long as you don't let anyone use the table between the initial load and
> when the index build finishes, you don't have to compromise on
> correctness.  But yeah, makes sense to worry about query speed first.
>
>
>
>
>
>
> > If you repeat the same query, is it then faster, or is it still slow?
>
> If you keep the expression exactly the same, it still takes a few seconds
> as could be expected for such a torture test query, but it's still WAY
> faster than the first such query. If you change it out to a different
> expression, it's longer again of course. There does seem to be a
> low-to-medium correlation between the number of rows found and the query
> completion time.
>
>
> To make this quick, you will need to get most of the table and most of the
> index cached into RAM.  A good way to do that is with pg_prewarm.  Of
> course that only works if you have enough RAM in the first place.
>
> What is the size of the table and the gin index?
>
>
> Cheers,
>
> Jeff
>
>


Re: insert and query performance on big string table with pg_trgm

2017-11-24 Thread Jeff Janes
On Nov 21, 2017 00:05, "Matthew Hall"  wrote:


> Are all indexes present at the time you insert?  It will probably be much
faster to insert without the gin index (at least) and build it after the
load.

There is some flexibility on the initial load, but the updates in the
future will require the de-duplication capability. I'm willing to accept
that might be somewhat slower on the load process, to get the accurate
updates, provided we could try meeting the read-side goal I wrote about, or
at least figure out why it's impossible, so I can understand what I need to
fix to make it possible.


As long as you don't let anyone use the table between the initial load and
when the index build finishes, you don't have to compromise on
correctness.  But yeah, makes sense to worry about query speed first.






> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds
as could be expected for such a torture test query, but it's still WAY
faster than the first such query. If you change it out to a different
expression, it's longer again of course. There does seem to be a
low-to-medium correlation between the number of rows found and the query
completion time.


To make this quick, you will need to get most of the table and most of the
index cached into RAM.  A good way to do that is with pg_prewarm.  Of
course that only works if you have enough RAM in the first place.

What is the size of the table and the gin index?


Cheers,

Jeff


Re: insert and query performance on big string table with pg_trgm

2017-11-21 Thread Matthew Hall
Hi Jeff,

Thanks so much for writing. You've got some great points.

> On Nov 20, 2017, at 5:42 PM, Jeff Janes  wrote:
> While I have not done exhaustive testing, from the tests I have done I've 
> never found gist to be better than gin with trgm indexes.

Thanks, this helps considerably, as the documentation was kind of confusing and 
I didn't want to get it wrong if I could avoid it.

> Do you really need the artificial primary key, when you already have another 
> column that would be used as the primary key?  If you need to use this it a 
> foreign key in another type, then very well might.  But maintaining two 
> unique indexes doesn't come free.

OK, fair enough, I'll test with it removed and see what happens.

> Are all indexes present at the time you insert?  It will probably be much 
> faster to insert without the gin index (at least) and build it after the load.

There is some flexibility on the initial load, but the updates in the future 
will require the de-duplication capability. I'm willing to accept that might be 
somewhat slower on the load process, to get the accurate updates, provided we 
could try meeting the read-side goal I wrote about, or at least figure out why 
it's impossible, so I can understand what I need to fix to make it possible.

> Without knowing this key fact, it is hard to interpret the rest of your data.

I'm assuming you're referring to the part about the need for the primary key, 
and the indexes during loading? I did try to describe that in the earlier mail, 
but obviously I'm new at writing these, so sorry if I didn't make it more 
clear. I can get rid of the bigserial PK and the indexes could be made 
separately, but I would need a way to de-duplicate on future reloading... 
that's why I had the ON CONFLICT DO NOTHING expression on the INSERT. So we'd 
still want to learn why the INSERT is slow to fix up the update processes that 
would happen in the future.

> * maintenance_work_mem 512 MB
> 
> Building a gin index in bulk could benefit from more memory here. 

Fixed it; I will re-test w/ 1 GB. Have you got any recommended values so I 
don't screw it up?

> * synchronous_commit off
> 
> If you already are using unlogged tables, this might not be so helpful, but 
> does increase the risk of the rest of your system.

Fixed it; the unlogged mode change came later than this did.

>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
> 29578 postgres  20   0 6575672 6.149g 6.139g R  86.0 39.7  45:24.97 postgres
> 
> You should expand the command line (by hitting 'c', at least in my version of 
> top) so we can see which postgres process this is.

Good point, I'll write back once I retry w/ your other advice.

> explain (analyze, buffers), please.  And hopefully with track_io_timing=on.

track_io_timing was missing because sadly I had only found it in one document 
at the very end of the investigation, after doing the big job which generated 
all of the material posted. It's there now, so here is some better output on 
the query:

explain (analyze, buffers) select * from huge_table where value ilike 
'%canada%';

 Bitmap Heap Scan on huge_table  (cost=273.44..61690.09 rows=16702 width=33) 
(actual time=5701.511..76469.688 rows=110166 loops=1)
   Recheck Cond: ((value)::text ~~* '%canada%'::text)
   Rows Removed by Index Recheck: 198
   Heap Blocks: exact=66657
   Buffers: shared hit=12372 read=56201 dirtied=36906
   I/O Timings: read=74195.734
   ->  Bitmap Index Scan on huge_table_value_trgm  (cost=0.00..269.26 
rows=16702 width=0) (actual time=5683.032..5683.032 rows=110468 loops=1)
 Index Cond: ((value)::text ~~* '%canada%'::text)
 Buffers: shared hit=888 read=1028
 I/O Timings: read=5470.839
 Planning time: 0.271 ms
 Execution time: 76506.949 ms

I will work some more on the insert piece.

> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds as 
could be expected for such a torture test query, but it's still WAY faster than 
the first such query. If you change it out to a different expression, it's 
longer again of course. There does seem to be a low-to-medium correlation 
between the number of rows found and the query completion time.

> Cheers,
> Jeff

Thanks,
Matthew.