Alright, so it seems like partitioning and changing the shared_buffers as
well as adding the order by helped to a certain extent, but the writes are
still slow. Inserting a 1 million records file is taking almost 3 minutes
(way better than the 20+ minutes, but still pretty slow compared to the 20
seconds it used to take).

The interesting thing for me right now is: If I try to insert the data from
a file that has already been inserted (meaning all the data will end up
being rejected due to the unique constraint), it only takes between 1 and 4
seconds for the insertion to finish executing. For regular files (which
usually have 30% new unique records (meaning about 300k new records)), it
is taking those 3 minutes.

**UPDATE**

I started writing this email and then it occurred to me something I should
try. Leaving the information above for historical reasons.

Basically I went ahead and ran a `reindex` on all the partitions now to see
if it would improve the performance and seems like that did it! I used the
following script to reindex all of the partitions (the name of my
partitions all start with ubp_):

```
DO $$DECLARE r record;
BEGIN
    FOR r IN select indexname from pg_indexes where tablename like 'ubp_%'
    LOOP
        EXECUTE 'reindex index ' || r.indexname;
    END LOOP;
END$$;
```

After doing this, processing of each file is taking anything between 8 and
20 seconds (most of them seem to be taking 8 seconds though). So, this is
great!

In summary, what I ended up having to do was:

* Raise shared_buffers to 160GB
* Add an `order by` to the `select` subquery in the `insert` statement
* Partition the table
* Tune postgres configurations as shown below:

~~~
ssl = off
shared_buffers = 160GB
work_mem = 12GB
maintenance_work_mem = 12GB
max_stack_depth = 4MB
synchronous_commit = off
wal_writer_flush_after = 128MB
max_wal_size = 32GB
min_wal_size = 80MB
effective_cache_size = 96GB
~~~

I can't tell if the raising of the `shared_buffers` was the reason for the
performance gains or the adding of the `order by` was the responsible.
Doesn't hurt to do both anyways. I know for a fact that the `reindex` of
each partition made a huge difference in the end as explained above
(bringing insert time down from 3 minutes to 8 seconds).

I have about 1800 files in my backlog to be processed now (18 billion
records). I have started processing them and will report back in case
performance degrades once again.

Thanks everybody for the help so far! I really appreciate it.

Henrique

PS: I checked the `dirty` ratios for the OS:

$ sysctl vm.dirty_ratio
vm.dirty_ratio = 20

$ sysctl vm.dirty_background_ratio
vm.dirty_background_ratio = 10

$ sysctl vm.dirty_expire_centisecs
vm.dirty_expire_centisecs = 3000

These are default values; if what I understood from them is right, it seems
to me that these values should be fine.

On Mon, Jul 13, 2020 at 9:02 PM Henrique Montenegro <typ...@gmail.com>
wrote:

>
>
> On Mon, Jul 13, 2020 at 8:05 PM Jeff Janes <jeff.ja...@gmail.com> wrote:
>
>> On Mon, Jul 13, 2020 at 10:23 AM Henrique Montenegro <typ...@gmail.com>
>> wrote:
>>
>> insert into users_no_dups (
>>>     created_ts,
>>>     user_id,
>>>     name,
>>>     url
>>> ) (
>>>     select
>>>         created_ts,
>>>         user_id,
>>>         name,
>>>         url
>>>     from
>>>         users
>>> ) on conflict do nothing
>>>
>>
>> Once the size of the only index exceeds shared_buffers by a bit (the
>> amount of "a bit" depends on your RAM, kernel version, settings
>> for dirty_background_ratio, dirty_expire_centisecs, and probably other
>> things, and is not easy to predict) the performance falls off a cliff when
>> inserting values in a random order.  Every insert dirties a random index
>> leaf page, which quickly gets evicted from shared_buffers to make room for
>> other random leaf pages to be read in, and then turns into flush calls when
>> the kernel freaks out about the amount and age of dirty pages held in
>> memory.
>>
>
> That is interesting to  know. I will do some research on those things.
>
>
>> What happens if you add an "ORDER BY user_id" to your above select?
>>
>
> I don't know. I will give it a try right now.
>
>>
>>
>>> shared_buffers = 8GB
>>> RAM: 256GB
>>>
>>
>> Or, crank up shared_buffers by a lot.  Like, beyond the size of the
>> growing index, or up to 240GB if the index ever becomes larger than that.
>> And make the time between checkpoints longer.  If the dirty buffers are
>> retained in shared_buffers longer, chances of them getting dirtied
>> repeatedly between writes is much higher than if you just toss them to the
>> kernel and hope for the best.
>>
>>
> I cranked it up to 160GB to see how it goes.
>
> Cheers,
>>
>> Jeff
>>
>
> I created the partitions as well as mentioned before. I was able to
> partition the table based on the user_id (found some logic to it). I was
> transferring the data from the original table (about 280 million records;
> 320GB) to the new partitioned table and things were going well with write
> speeds between 30MB/s and 50MB/s. After reading 270GB of the 320GB (in 4
> and a half hours) and writing it to the new partitioned table, write speed
> went down to 7KB/s. It is so frustrating.
>
> I will keep the partitions and try your suggestions to see how it goes.
>
> I apologize for the long time between replies, it is just that testing
> this stuff takes 4+ hours each run.
>
> If there are any other suggestions of things for me to look meanwhile as
> well, please keep them coming.
>
> Thanks!
>
> Henrique
>

Reply via email to