https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113504

            Bug ID: 113504
           Summary: High memory usage for parallel `std::sort`
           Product: gcc
           Version: 12.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ruben.laso at tuwien dot ac.at
  Target Milestone: ---

The memory usage of parallel `std::sort` is very high compared to the
sequential version and even other parallel implementations.
The attached code is a simple test case to compare the memory usage of parallel
`std::sort`, `tbb::parallel_sort` and sequential `std::sort`.
The test case has been replicated in several systems with versions of GCC 10,
11 and 12.

An example of the results (and max. memory usage according to `/usr/bin/time`)
is shown in the following table:

| Executable         | Size        | Time     | Max Resident Memory |
| ------------------ | ----------- | -------- | ------------------- |
| ./pstl_sort.out    | 33554432    | 0:00.23  | 423776k             |
| ./tbb_sort.out     | 33554432    | 0:00.44  | 143952k             |
| ./seq_sort.out     | 33554432    | 0:03.32  | 134836k             |
| ./pstl_sort.out    | 1073741824  | 0:05.68  | 14236656k           |
| ./tbb_sort.out     | 1073741824  | 0:13.02  | 4207680k            |
| ./seq_sort.out     | 1073741824  | 2:07.38  | 4198124k            |

In the example, the parallel `std::sort` (pstl_sort) uses ~3 times more memory
than the `tbb::parallel_sort` (tbb_sort) and the sequential `std::sort`
(seq_sort).
It also runs faster, though.

System specs in the example:
CPU: AMD EPYC 7551
RAM: 256 GB DDR4
OS: Debian 10.10

Compilation with:
g++ -std=c++17 -O3 -pedantic -Wall -Wextra -Werror -o pstl_sort.out main.cpp
-ltbb -DPSTL_SORT
g++ -std=c++17 -O3 -pedantic -Wall -Wextra -Werror -o tbb_sort.out main.cpp
-ltbb -DTBB_SORT
g++ -std=c++17 -O3 -pedantic -Wall -Wextra -Werror -o seq_sort.out main.cpp
-ltbb

Did I miss something in the code?
Is that high memory usage a deliberate trade-off for performance?
Is the algorithm still in development to improve memory usage?

Reply via email to