[deal.II] Poor parallel efficiency of FEValues::reinit() with Workstream

Jonas Knoch Thu, 04 May 2023 03:44:35 -0700

Dear deal.II community,

I have implemented a multiscale finite element problem using deal.II. It 
includes PDEs on different levels, namely a macroscopic/effective PDE and 
so-called "cell problems" (not to be confused with cells of the grid), 
which have to be solved in each quadrature point of the macroscopic grid in 
order to compute the effective coefficient of the macroscopic PDE.


I am trying to parallelize parts of the code using the Workstream class, in 
particular the solution of the cell problems. Unfortunately, I observe 
suboptimal parallel efficiency when I compare the runtimes for different 
numbers of threads: as an example, the efficiency is only approx. 75% when 
I change from one thread to two threads with 
MultiThreadInfo::set_thread_limit(). I should mention that I only compare 
the runtimes for the parallelized part of the program, i.e. I expect an 
efficiency close to 100%.

Now I was able to trace back the efficiency problem to the following issue: 
when assembling the system matrix for a cell problem (which has to be done 
for each quadrature point of the macroscopic grid), I measure the following 
wall times for one call to scratch.fe_values.reinit(cell):
- one thread: ~1.5e-6s,
- two threads: ~5e-6s.
If the time spent for the task (i.e. the local assembly) on a cell is 
comparably small, the parallel efficiency suffers. 

However, the problem vanishes (regain of ~100% efficiency) if the time 
spent on one cell during assembly is increased, e.g. by using higher order 
finite elements or by simply putting the thread to sleep for a sufficiently 
long time. This is not really an option in my application.

I have also attached a MWE. A similar behavior can be observed in tutorial 
step 9 when the FE order in line 259 is lowered from 5 to 1. 

My questions are:
- What exactly is the reason for this behavior of the 
FEValues::reinit-function, or how can I find it out by myself?
- Is there anything I can do to circumvent this problem?

Thank you in advance, any help/hint is highly appreciated.

Best,
Jonas

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dealii+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dealii/47fa95e5-bcb3-479c-a253-2f0129ef1afdn%40googlegroups.com.

#include <deal.II/grid/tria.h>
#include <deal.II/dofs/dof_handler.h>
#include <deal.II/grid/grid_generator.h>
#include <deal.II/fe/fe_q.h>
#include <deal.II/dofs/dof_tools.h>
#include <deal.II/fe/fe_values.h>
#include <deal.II/base/quadrature_lib.h>
#include <deal.II/base/function.h>
#include <deal.II/numerics/vector_tools.h>
#include <deal.II/numerics/matrix_tools.h>
#include <deal.II/lac/vector.h>
#include <deal.II/lac/full_matrix.h>
#include <deal.II/lac/sparse_matrix.h>
#include <deal.II/lac/dynamic_sparsity_pattern.h>
#include <deal.II/numerics/data_out.h>
#include <deal.II/base/work_stream.h>
#include <deal.II/base/multithread_info.h>
#include <deal.II/base/timer.h>


#include <fstream>
#include <iostream>
#include <thread>
#include <chrono>

#include <deal.II/base/logstream.h>

using namespace dealii;
using namespace std::chrono_literals;


template <int dim>
class Efficiency_Issue
{
public:
  Efficiency_Issue();
  void run();

private:
  void make_grid();
  void setup_system();

  Triangulation<dim> triangulation;
  FE_Q<dim>          fe;
  DoFHandler<dim>    dof_handler;

  SparsityPattern      sparsity_pattern;
  SparseMatrix<double> system_matrix;
  Vector<double> system_rhs;

  struct PerTaskData{
  		FullMatrix<double> cell_matrix;
  		Vector<double> cell_rhs;
  		std::vector<unsigned int> dof_indices;
  };

  struct ScratchData {
	  ScratchData(const FiniteElement<dim> &fe);
	  ScratchData(const ScratchData &scratch);
	  FEValues<dim> fe_values;
  };

  void assemble_on_one_cell(const typename DoFHandler<dim>::active_cell_iterator &cell,
		  ScratchData &scratch, PerTaskData &data);
  void copy_local_to_global(const PerTaskData &data);
  void assemble_system_in_parallel();
};

template <int dim>
Efficiency_Issue<dim>::ScratchData::ScratchData(const FiniteElement<dim> &fe)
: fe_values(fe,
		QGauss<dim>(fe.degree+1),
		update_values | update_gradients |
		update_quadrature_points | update_JxW_values)
{}

template <int dim>
Efficiency_Issue<dim>::ScratchData::ScratchData(const ScratchData &scratch)
: fe_values(scratch.fe_values.get_fe(),
		scratch.fe_values.get_quadrature(),
		update_values | update_gradients |
		update_quadrature_points | update_JxW_values)
{}

template <int dim>
Efficiency_Issue<dim>::Efficiency_Issue()
  : fe(1)
  , dof_handler(triangulation)
{}

template <int dim>
void Efficiency_Issue<dim>::make_grid()
{
  GridGenerator::hyper_cube(triangulation, -1, 1);
  triangulation.refine_global(8);

  std::cout << "   Number of active cells: " << triangulation.n_active_cells() << std::endl;
}

template <int dim>
void Efficiency_Issue<dim>::setup_system()
{
  dof_handler.distribute_dofs(fe);
}

template <int dim>
void Efficiency_Issue<dim>::assemble_on_one_cell(const typename DoFHandler<dim>::active_cell_iterator &cell,
		ScratchData &scratch, PerTaskData &data)
{
	scratch.fe_values.reinit(cell);
	std::this_thread::sleep_for(1e-6s);
	/*
	 * efficiency of 23% with sleep_for(0s)
	 * efficiency of 56% with sleep_for(1e-6s)
	 * efficiency of 79% with sleep_for(1e-5s)
	 * efficiency of 97% with sleep_for(1e-4s)
	 */
}

template <int dim>
void Efficiency_Issue<dim>::copy_local_to_global(const PerTaskData &data)
{
	//
}

template <int dim>
void Efficiency_Issue<dim>::assemble_system_in_parallel()
{
	int N_threads = 1;
	MultithreadInfo::set_thread_limit(N_threads);

	Timer timer;
	timer.start();
	WorkStream::run(dof_handler.begin_active(),
			dof_handler.end(),
			*this,
			&Efficiency_Issue::assemble_on_one_cell,
			&Efficiency_Issue::copy_local_to_global,
			ScratchData(fe),
			PerTaskData());

	std::cout << "Elapsed wall time with " << N_threads << " thread(s): " << timer.wall_time() << std::endl;
}

template <int dim>
void Efficiency_Issue<dim>::run()
{
	make_grid();
	setup_system();
	assemble_system_in_parallel();
}

int main()
{
	Efficiency_Issue<2> problem;

	problem.run();

	return 0;
}

[deal.II] Poor parallel efficiency of FEValues::reinit() with Workstream

Reply via email to