Re: [Paraview] Parallel Streamtracer
Hello Burlen, thank you very much for your post. I really would like to test your plugin and so I've start to build it. Unfortunately I've got a lot of compiler errors (e.g. vtkstd isn't used in PV master anymore). Which PV version is the base for your plugin? Regards, Stephan -Ursprüngliche Nachricht- Von: Burlen Loring [mailto:blor...@lbl.gov] Gesendet: Donnerstag, 7. Juni 2012 17:54 An: Stephan Rogge Cc: 'Yuanxin Liu'; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi Stephan, I've experienced the scaling behavior that you report when I was working on a project that required generating millions of streamlines for a topological mapping algorithm interactively in ParaView. To get the required scaling I wrote a stream tracer that uses a load on demand approach with tunable block cache so that all ranks could integrate any streamline and stay busy throughout the entire computation. It was very effective on our data and I've used it to integrate 30 Million streamlines in about 10min on 256 cores. If you really need better scalability than the distributed data tracing approach implemented in PV, you might take a look at our work. The down side of our approach is that in order to provide the demand loading the reader has to implement a vtk object that provides an api giving the integrator direct access to I/O functionality. In case you're interested the stream tracer is class is vtkSQFieldTracer and our reader is vtkSQBOVReader. The latest release could be found here https://github.com/burlen/SciberQuestToolKit/tarball/SQTK-20120531 Burlen On 06/04/2012 02:21 AM, Stephan Rogge wrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Roggestephan.ro...@tu-cottbus.de wrote: Leo, As I mentioned in my initial post of this thread: I used the up-to-date master branch of ParaView. Which means I have already used your implementation. I can imagine, to parallelize this algorithm can be very tough. And I can see that distribute the calculation over 8 processes does not lead to a nice scaling. But I don't understand this huge amount of latency when using the StreamTracer in a Cave-Mode with two view ports and two pvserver processes on the same machine (extra machine for the client). I guess the tracer filter is applied for each viewport separately? This would be ok as long as both filter executions run parallel. And I doubt that this is the case. Can you help to clarify my problem? Regards
Re: [Paraview] Parallel Streamtracer
Someone told me that you have to clear your build directory completely and start a fresh PV build. Stephan -Ursprüngliche Nachricht- Von: burlen [mailto:burlen.lor...@gmail.com] Gesendet: Freitag, 8. Juni 2012 16:21 An: Stephan Rogge Cc: 'Yuanxin Liu'; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi Stephan, Oh, thanks for the update, I wasn't aware of these changes. I have been working with 3.14.1. Burlen On 06/08/2012 01:47 AM, Stephan Rogge wrote: Hello Burlen, thank you very much for your post. I really would like to test your plugin and so I've start to build it. Unfortunately I've got a lot of compiler errors (e.g. vtkstd isn't used in PV master anymore). Which PV version is the base for your plugin? Regards, Stephan -Ursprüngliche Nachricht- Von: Burlen Loring [mailto:blor...@lbl.gov] Gesendet: Donnerstag, 7. Juni 2012 17:54 An: Stephan Rogge Cc: 'Yuanxin Liu'; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi Stephan, I've experienced the scaling behavior that you report when I was working on a project that required generating millions of streamlines for a topological mapping algorithm interactively in ParaView. To get the required scaling I wrote a stream tracer that uses a load on demand approach with tunable block cache so that all ranks could integrate any streamline and stay busy throughout the entire computation. It was very effective on our data and I've used it to integrate 30 Million streamlines in about 10min on 256 cores. If you really need better scalability than the distributed data tracing approach implemented in PV, you might take a look at our work. The down side of our approach is that in order to provide the demand loading the reader has to implement a vtk object that provides an api giving the integrator direct access to I/O functionality. In case you're interested the stream tracer is class is vtkSQFieldTracer and our reader is vtkSQBOVReader. The latest release could be found here https://github.com/burlen/SciberQuestToolKit/tarball/SQTK-20120531 Burlen On 06/04/2012 02:21 AM, Stephan Rogge wrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Roggestephan.ro...@tu-cottbus.de wrote: Leo, As I mentioned in my initial post of this thread: I used the up-to-date master branch of ParaView. Which means I have already used your implementation. I can imagine, to parallelize
Re: [Paraview] Parallel Streamtracer
OK, you had me a little worried there, ;) I will send you some instructions and example data to test with, our network is down due to an unexpected power outage so it won't be today. Burlen On 06/08/2012 07:25 AM, Stephan Rogge wrote: Someone told me that you have to clear your build directory completely and start a fresh PV build. Stephan -Ursprüngliche Nachricht- Von: burlen [mailto:burlen.lor...@gmail.com] Gesendet: Freitag, 8. Juni 2012 16:21 An: Stephan Rogge Cc: 'Yuanxin Liu'; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi Stephan, Oh, thanks for the update, I wasn't aware of these changes. I have been working with 3.14.1. Burlen On 06/08/2012 01:47 AM, Stephan Rogge wrote: Hello Burlen, thank you very much for your post. I really would like to test your plugin and so I've start to build it. Unfortunately I've got a lot of compiler errors (e.g. vtkstd isn't used in PV master anymore). Which PV version is the base for your plugin? Regards, Stephan -Ursprüngliche Nachricht- Von: Burlen Loring [mailto:blor...@lbl.gov] Gesendet: Donnerstag, 7. Juni 2012 17:54 An: Stephan Rogge Cc: 'Yuanxin Liu'; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi Stephan, I've experienced the scaling behavior that you report when I was working on a project that required generating millions of streamlines for a topological mapping algorithm interactively in ParaView. To get the required scaling I wrote a stream tracer that uses a load on demand approach with tunable block cache so that all ranks could integrate any streamline and stay busy throughout the entire computation. It was very effective on our data and I've used it to integrate 30 Million streamlines in about 10min on 256 cores. If you really need better scalability than the distributed data tracing approach implemented in PV, you might take a look at our work. The down side of our approach is that in order to provide the demand loading the reader has to implement a vtk object that provides an api giving the integrator direct access to I/O functionality. In case you're interested the stream tracer is class is vtkSQFieldTracer and our reader is vtkSQBOVReader. The latest release could be found here https://github.com/burlen/SciberQuestToolKit/tarball/SQTK-20120531 Burlen On 06/04/2012 02:21 AM, Stephan Rogge wrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Roggestephan.ro...@tu-cottbus.de wrote: Leo, As I mentioned in my initial post of this thread: I used the up
Re: [Paraview] Parallel Streamtracer
Hi Leo, Thanks, yes please send your fixes, or you could also push them to github. Which ever you prefer. Burlen On 06/08/2012 09:10 AM, Yuanxin Liu wrote: Hi, I have recently gotten Burlen's code and updated it to work with the latest ParaView. Aside from vtkstd, there are also a few backward incompatible VTK changes ( see the VTK6.0 section on the VTK wiki). But it is not too much work. I will be happy send either of you my code changes if you need a reference. Leo On Fri, Jun 8, 2012 at 10:25 AM, Stephan Rogge stephan.ro...@tu-cottbus.de mailto:stephan.ro...@tu-cottbus.de wrote: Someone told me that you have to clear your build directory completely and start a fresh PV build. Stephan -Ursprüngliche Nachricht- Von: burlen [mailto:burlen.lor...@gmail.com mailto:burlen.lor...@gmail.com] Gesendet: Freitag, 8. Juni 2012 16:21 An: Stephan Rogge Cc: 'Yuanxin Liu'; paraview@paraview.org mailto:paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi Stephan, Oh, thanks for the update, I wasn't aware of these changes. I have been working with 3.14.1. Burlen On 06/08/2012 01:47 AM, Stephan Rogge wrote: Hello Burlen, thank you very much for your post. I really would like to test your plugin and so I've start to build it. Unfortunately I've got a lot of compiler errors (e.g. vtkstd isn't used in PV master anymore). Which PV version is the base for your plugin? Regards, Stephan -Ursprüngliche Nachricht- Von: Burlen Loring [mailto:blor...@lbl.gov mailto:blor...@lbl.gov] Gesendet: Donnerstag, 7. Juni 2012 17:54 An: Stephan Rogge Cc: 'Yuanxin Liu'; paraview@paraview.org mailto:paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi Stephan, I've experienced the scaling behavior that you report when I was working on a project that required generating millions of streamlines for a topological mapping algorithm interactively in ParaView. To get the required scaling I wrote a stream tracer that uses a load on demand approach with tunable block cache so that all ranks could integrate any streamline and stay busy throughout the entire computation. It was very effective on our data and I've used it to integrate 30 Million streamlines in about 10min on 256 cores. If you really need better scalability than the distributed data tracing approach implemented in PV, you might take a look at our work. The down side of our approach is that in order to provide the demand loading the reader has to implement a vtk object that provides an api giving the integrator direct access to I/O functionality. In case you're interested the stream tracer is class is vtkSQFieldTracer and our reader is vtkSQBOVReader. The latest release could be found here https://github.com/burlen/SciberQuestToolKit/tarball/SQTK-20120531 Burlen On 06/04/2012 02:21 AM, Stephan Rogge wrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single
Re: [Paraview] Parallel Streamtracer
Hi Stephan, As promised here are instructions and a small test dataset. http://www.hpcvis.com/vis/sq-field-tracer.html Burlen On 06/08/2012 11:14 AM, burlen wrote: OK, you had me a little worried there, ;) I will send you some instructions and example data to test with, our network is down due to an unexpected power outage so it won't be today. Burlen On 06/08/2012 07:25 AM, Stephan Rogge wrote: Someone told me that you have to clear your build directory completely and start a fresh PV build. Stephan -Ursprüngliche Nachricht- Von: burlen [mailto:burlen.lor...@gmail.com] Gesendet: Freitag, 8. Juni 2012 16:21 An: Stephan Rogge Cc: 'Yuanxin Liu'; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi Stephan, Oh, thanks for the update, I wasn't aware of these changes. I have been working with 3.14.1. Burlen On 06/08/2012 01:47 AM, Stephan Rogge wrote: Hello Burlen, thank you very much for your post. I really would like to test your plugin and so I've start to build it. Unfortunately I've got a lot of compiler errors (e.g. vtkstd isn't used in PV master anymore). Which PV version is the base for your plugin? Regards, Stephan -Ursprüngliche Nachricht- Von: Burlen Loring [mailto:blor...@lbl.gov] Gesendet: Donnerstag, 7. Juni 2012 17:54 An: Stephan Rogge Cc: 'Yuanxin Liu'; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi Stephan, I've experienced the scaling behavior that you report when I was working on a project that required generating millions of streamlines for a topological mapping algorithm interactively in ParaView. To get the required scaling I wrote a stream tracer that uses a load on demand approach with tunable block cache so that all ranks could integrate any streamline and stay busy throughout the entire computation. It was very effective on our data and I've used it to integrate 30 Million streamlines in about 10min on 256 cores. If you really need better scalability than the distributed data tracing approach implemented in PV, you might take a look at our work. The down side of our approach is that in order to provide the demand loading the reader has to implement a vtk object that provides an api giving the integrator direct access to I/O functionality. In case you're interested the stream tracer is class is vtkSQFieldTracer and our reader is vtkSQBOVReader. The latest release could be found here https://github.com/burlen/SciberQuestToolKit/tarball/SQTK-20120531 Burlen On 06/04/2012 02:21 AM, Stephan Rogge wrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data
Re: [Paraview] Parallel Streamtracer
Hi Stephan, I've experienced the scaling behavior that you report when I was working on a project that required generating millions of streamlines for a topological mapping algorithm interactively in ParaView. To get the required scaling I wrote a stream tracer that uses a load on demand approach with tunable block cache so that all ranks could integrate any streamline and stay busy throughout the entire computation. It was very effective on our data and I've used it to integrate 30 Million streamlines in about 10min on 256 cores. If you really need better scalability than the distributed data tracing approach implemented in PV, you might take a look at our work. The down side of our approach is that in order to provide the demand loading the reader has to implement a vtk object that provides an api giving the integrator direct access to I/O functionality. In case you're interested the stream tracer is class is vtkSQFieldTracer and our reader is vtkSQBOVReader. The latest release could be found here https://github.com/burlen/SciberQuestToolKit/tarball/SQTK-20120531 Burlen On 06/04/2012 02:21 AM, Stephan Rogge wrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Roggestephan.ro...@tu-cottbus.de wrote: Leo, As I mentioned in my initial post of this thread: I used the up-to-date master branch of ParaView. Which means I have already used your implementation. I can imagine, to parallelize this algorithm can be very tough. And I can see that distribute the calculation over 8 processes does not lead to a nice scaling. But I don't understand this huge amount of latency when using the StreamTracer in a Cave-Mode with two view ports and two pvserver processes on the same machine (extra machine for the client). I guess the tracer filter is applied for each viewport separately? This would be ok as long as both filter executions run parallel. And I doubt that this is the case. Can you help to clarify my problem? Regards, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Donnerstag, 31. Mai 2012 21:33 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer It is in the current VTK and ParaView master. The class is vtkPStreamTracer. Leo On Thu, May 31, 2012 at 3:31 PM, Stephan Roggestephan.ro...@tu-cottbus.de wrote: Hi, Andy and Leo, thanks for your replies. Is it possible to get this new implementation? I would to give it a try. Regards, Stephan Am 31.05.2012 um 17:48 schrieb Yuanxin Liuleo@kitware.com: Hi, Stephan
Re: [Paraview] Parallel Streamtracer
By the way, did you make sure to apply D3? disk_out_ref.ex2 is not partitioned so by default it would be loaded entirely onto MPI rank 0. On Mon, Jun 4, 2012 at 5:21 AM, Stephan Rogge stephan.ro...@tu-cottbus.dewrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Leo, As I mentioned in my initial post of this thread: I used the up-to-date master branch of ParaView. Which means I have already used your implementation. I can imagine, to parallelize this algorithm can be very tough. And I can see that distribute the calculation over 8 processes does not lead to a nice scaling. But I don't understand this huge amount of latency when using the StreamTracer in a Cave-Mode with two view ports and two pvserver processes on the same machine (extra machine for the client). I guess the tracer filter is applied for each viewport separately? This would be ok as long as both filter executions run parallel. And I doubt that this is the case. Can you help to clarify my problem? Regards, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Donnerstag, 31. Mai 2012 21:33 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer It is in the current VTK and ParaView master. The class is vtkPStreamTracer. Leo On Thu, May 31, 2012 at 3:31 PM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hi, Andy and Leo, thanks for your replies. Is it possible to get this new implementation? I would to give it a try. Regards, Stephan Am 31.05.2012 um 17:48 schrieb Yuanxin Liu leo@kitware.com: Hi, Stephan, The previous implementation only has serial performance: It traces the streamlines one at a time and never starts a new streamline until the previous one finishes. With communication overhead, it is not surprising it got slower. My new implementation is able to let the processes working on different streamlines simultaneously and should scale much better. Leo On Thu, May 31, 2012 at 11:27 AM, Andy Bauer andy.ba...@kitware.com wrote: Hi Stephan, The parallel stream tracer uses the partitioning of the grid to determine which process does the integration. When the streamline exits the subdomain of a process there is a search to see if it enters a subdomain assigned to any other processes before figuring it whether it has left the entire domain. Leo, copied here, has been improving the streamline implementation
Re: [Paraview] Parallel Streamtracer
Hello Berk, absolutely. After applying both filter, D3 and StreamTracer, I've visualized the partitions with vtkProcessId to check whether D3 was applied or not and was able to see that the stream lines had different (homogenous) colors depending on their region. The D3 filter is only applied by more than one MPI process. To make things clearer: ## Bulitin (no D3): 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes (D3): 15.615 seconds ## 8 MPI-Processes(D3): 14.103 seconds and 100.000 seed points: ## Bulitin (no D3): 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes(D3): 168.1 seconds ## 8 MPI-Processes(D3): 171.325 seconds Sorry, for the confusion. Regrads, Stephan Von: Berk Geveci [mailto:berk.gev...@kitware.com] Gesendet: Donnerstag, 7. Juni 2012 02:53 An: Stephan Rogge Cc: Yuanxin Liu; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer By the way, did you make sure to apply D3? disk_out_ref.ex2 is not partitioned so by default it would be loaded entirely onto MPI rank 0. On Mon, Jun 4, 2012 at 5:21 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Leo, As I mentioned in my initial post of this thread: I used the up-to-date master branch of ParaView. Which means I have already used your implementation. I can imagine, to parallelize this algorithm can be very tough. And I can see that distribute the calculation over 8 processes does not lead to a nice scaling. But I don't understand this huge amount of latency when using the StreamTracer in a Cave-Mode with two view ports and two pvserver processes on the same machine (extra machine for the client). I guess the tracer filter is applied for each viewport separately? This would be ok as long as both filter executions run parallel. And I doubt that this is the case. Can you help to clarify my problem? Regards, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Donnerstag, 31. Mai 2012 21:33 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer It is in the current VTK and ParaView master. The class is vtkPStreamTracer. Leo On Thu, May 31, 2012 at 3:31 PM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hi, Andy and Leo, thanks for your replies. Is it possible to get this new implementation? I would to give it a try. Regards, Stephan Am 31.05.2012 um 17:48 schrieb Yuanxin Liu leo@kitware.com: Hi
Re: [Paraview] Parallel Streamtracer
Thanks, Leo. That's sounds great. I'm looking forward to have a parallel Stream Tracer for small vector fields. Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Montag, 4. Juni 2012 19:31 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I will look into the multi-core issue as well as the performance issue. Some quick answers: - Yes, the whole vector fields are partitioned and the streamlines are passed from one process to another. This is why the performance can be highly sensitive to how data are distributed and how the streamlines travel between data partitions. - Your suggestion makes sense if the data is small enough to be run on a single machine. This is definitely something we would like to do in the future. Right now, the implementation is more targeted towards handling large data that have to be distributed across multiple machines. Leo On Mon, Jun 4, 2012 at 5:21 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Leo, As I mentioned in my initial post of this thread: I used the up-to-date master branch of ParaView. Which means I have already used your implementation. I can imagine, to parallelize this algorithm can be very tough. And I can see that distribute the calculation over 8 processes does not lead to a nice scaling. But I don't understand this huge amount of latency when using the StreamTracer in a Cave-Mode with two view ports and two pvserver processes on the same machine (extra machine for the client). I guess the tracer filter is applied for each viewport separately? This would be ok as long as both filter executions run parallel. And I doubt that this is the case. Can you help to clarify my problem? Regards, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Donnerstag, 31. Mai 2012 21:33 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer It is in the current VTK and ParaView master. The class is vtkPStreamTracer. Leo On Thu, May 31, 2012 at 3:31 PM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hi, Andy and Leo, thanks for your replies. Is it possible to get this new implementation? I would to give it a try. Regards, Stephan Am 31.05.2012 um 17:48 schrieb Yuanxin Liu leo@kitware.com: Hi, Stephan, The previous implementation only has serial performance: It traces the streamlines one at a time and never
Re: [Paraview] Parallel Streamtracer
Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Leo, As I mentioned in my initial post of this thread: I used the up-to-date master branch of ParaView. Which means I have already used your implementation. I can imagine, to parallelize this algorithm can be very tough. And I can see that distribute the calculation over 8 processes does not lead to a nice scaling. But I don't understand this huge amount of latency when using the StreamTracer in a Cave-Mode with two view ports and two pvserver processes on the same machine (extra machine for the client). I guess the tracer filter is applied for each viewport separately? This would be ok as long as both filter executions run parallel. And I doubt that this is the case. Can you help to clarify my problem? Regards, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Donnerstag, 31. Mai 2012 21:33 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer It is in the current VTK and ParaView master. The class is vtkPStreamTracer. Leo On Thu, May 31, 2012 at 3:31 PM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hi, Andy and Leo, thanks for your replies. Is it possible to get this new implementation? I would to give it a try. Regards, Stephan Am 31.05.2012 um 17:48 schrieb Yuanxin Liu leo@kitware.com: Hi, Stephan, The previous implementation only has serial performance: It traces the streamlines one at a time and never starts a new streamline until the previous one finishes. With communication overhead, it is not surprising it got slower. My new implementation is able to let the processes working on different streamlines simultaneously and should scale much better. Leo On Thu, May 31, 2012 at 11:27 AM, Andy Bauer andy.ba...@kitware.com wrote: Hi Stephan, The parallel stream tracer uses the partitioning of the grid to determine which process does the integration. When the streamline exits the subdomain of a process there is a search to see if it enters a subdomain assigned to any other processes before figuring it whether it has left the entire domain. Leo, copied here, has been improving the streamline implementation inside of VTK so you may want to get his newer version. It is a pretty tough algorithm to parallelize efficiently without making any assumptions on the flow or partitioning. Andy On Thu, May 31, 2012 at 4:16 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hello, I have a question related to the parallelism
Re: [Paraview] Parallel Streamtracer
Hi, Stephan, I will look into the multi-core issue as well as the performance issue. Some quick answers: - Yes, the whole vector fields are partitioned and the streamlines are passed from one process to another. This is why the performance can be highly sensitive to how data are distributed and how the streamlines travel between data partitions. - Your suggestion makes sense if the data is small enough to be run on a single machine. This is definitely something we would like to do in the future. Right now, the implementation is more targeted towards handling large data that have to be distributed across multiple machines. Leo On Mon, Jun 4, 2012 at 5:21 AM, Stephan Rogge stephan.ro...@tu-cottbus.dewrote: Hello Leo, ok, I took the disk_out_ref.ex2 example data set and did some time measurements. Remember, my machine has 4 Cores + HyperThreading. My first observation is that PV seems to have a problem with distributing the data when the Multi-Core option (GUI) is enabled. When PV is started with builtin Multi-Core I was not able to apply a stream tracer with more than 1000 seed points (PV is freezing and never comes back). Otherwise, when pvserver processes has been started manually I was able to set up to 100.000 seed points. Is it a bug? Now let's have a look on the scaling performance. As you suggested, I've used the D3 filter for distributing the data along the processes. The stream tracer execution time for 10.000 seed points: ## Bulitin: 10.063 seconds ## 1 MPI-Process (no D3): 10.162 seconds ## 4 MPI-Processes: 15.615 seconds ## 8 MPI-Processes: 14.103 seconds and 100.000 seed points: ## Bulitin: 100.603 seconds ## 1 MPI-Process (no D3): 100.967 seconds ## 4 MPI-Processes: 168.1 seconds ## 8 MPI-Processes: 171.325 seconds I cannot see any positive scaling behavior here. Maybe is this example not appropriate for scaling measurements? One more thing: I've visualized the vtkProcessId and saw that the whole vector field is partitioned. I thought, that each streamline is integrated in its own process. But it seems that this is not the case. This could explain my scaling issues: In cases of small vector fields the overhead of synchronization becomes too large and decreases the overall performance. My suggestion is to have a parallel StreamTracer which is built for a single machine with several threads. Could be worth to randomly distribute the seeds over all available (local) processes? Of course, each process have access on the whole vector field. Cheers, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Freitag, 1. Juni 2012 16:13 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Leo, As I mentioned in my initial post of this thread: I used the up-to-date master branch of ParaView. Which means I have already used your implementation. I can imagine, to parallelize this algorithm can be very tough. And I can see that distribute the calculation over 8 processes does not lead to a nice scaling. But I don't understand this huge amount of latency when using the StreamTracer in a Cave-Mode with two view ports and two pvserver processes on the same machine (extra machine for the client). I guess the tracer filter is applied for each viewport separately? This would be ok as long as both filter executions run parallel. And I doubt that this is the case. Can you help to clarify my problem? Regards, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Donnerstag, 31. Mai 2012 21:33 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer It is in the current VTK and ParaView master. The class is vtkPStreamTracer. Leo On Thu, May 31, 2012 at 3:31 PM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hi, Andy and Leo, thanks for your replies. Is it possible to get this new implementation? I would to give it a try. Regards, Stephan Am 31.05.2012 um 17:48 schrieb Yuanxin Liu leo@kitware.com: Hi, Stephan, The previous implementation only has serial performance: It traces the streamlines one at a time and never starts a new streamline until the previous one finishes. With communication overhead, it is not surprising it got slower. My new implementation is able to let the processes working on different streamlines
Re: [Paraview] Parallel Streamtracer
Hi, Stephan, I did measure the performance at some point and was able to get fairly decent speed up with more processors. So I am surprised you are seeing huge latency. Of course, the performance is sensitive to the input. It is also sensitive to how readers distribute data. So, one thing you might want to try is to attach the D3 filter to the reader. If that doesn't help, I will be happy to get your data and take a look. Leo On Fri, Jun 1, 2012 at 1:54 AM, Stephan Rogge stephan.ro...@tu-cottbus.dewrote: Leo, As I mentioned in my initial post of this thread: I used the up-to-date master branch of ParaView. Which means I have already used your implementation. I can imagine, to parallelize this algorithm can be very tough. And I can see that distribute the calculation over 8 processes does not lead to a nice scaling. But I don't understand this huge amount of latency when using the StreamTracer in a Cave-Mode with two view ports and two pvserver processes on the same machine (extra machine for the client). I guess the tracer filter is applied for each viewport separately? This would be ok as long as both filter executions run parallel. And I doubt that this is the case. Can you help to clarify my problem? Regards, Stephan Von: Yuanxin Liu [mailto:leo@kitware.com] Gesendet: Donnerstag, 31. Mai 2012 21:33 An: Stephan Rogge Cc: Andy Bauer; paraview@paraview.org Betreff: Re: [Paraview] Parallel Streamtracer It is in the current VTK and ParaView master. The class is vtkPStreamTracer. Leo On Thu, May 31, 2012 at 3:31 PM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hi, Andy and Leo, thanks for your replies. Is it possible to get this new implementation? I would to give it a try. Regards, Stephan Am 31.05.2012 um 17:48 schrieb Yuanxin Liu leo@kitware.com: Hi, Stephan, The previous implementation only has serial performance: It traces the streamlines one at a time and never starts a new streamline until the previous one finishes. With communication overhead, it is not surprising it got slower. My new implementation is able to let the processes working on different streamlines simultaneously and should scale much better. Leo On Thu, May 31, 2012 at 11:27 AM, Andy Bauer andy.ba...@kitware.com wrote: Hi Stephan, The parallel stream tracer uses the partitioning of the grid to determine which process does the integration. When the streamline exits the subdomain of a process there is a search to see if it enters a subdomain assigned to any other processes before figuring it whether it has left the entire domain. Leo, copied here, has been improving the streamline implementation inside of VTK so you may want to get his newer version. It is a pretty tough algorithm to parallelize efficiently without making any assumptions on the flow or partitioning. Andy On Thu, May 31, 2012 at 4:16 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hello, I have a question related to the parallelism of the stream tracer: As I understand the code right, each line integration (trace) is processed in an own MPI process. Right? To test the scalability of the Stream tracer I've load a structured (curvilinear) grid and applied the filter with a Seed resolution of 1500 and check the timings in a single and multi-thread (Multi Core enabled in PV GUI) situation. I was really surprised that multi core slows done the execution time to 4 seconds. The single core takes only 1.2 seconds. Data migration cannot be the explanation for that behavior (0.5 seconds). What is the problem here? Please see attached some statistics... Data: * Structured (Curvilinear) Grid * 244030 Cells * 37 MB Memory System: * Intel i7-2600K (4 Cores + HT = 8 Threads) * 16 GB Ram * Windows 7 64 Bit * ParaView (master-branch, 64 bit compilation) # Single Thread (Seed resolution 1500): # Local Process Still Render, 0.014 seconds RenderView::Update, 1.222 seconds vtkPVView::Update, 1.222 seconds Execute vtkStreamTracer id: 2184, 1.214 seconds Still Render, 0.015 seconds # Eight Threads (Seed resolution 1500): # Local Process Still Render, 0.029 seconds RenderView::Update, 4.134 seconds vtkSMDataDeliveryManager: Deliver Geome, 0.619 seconds FullRes Data Migration, 0.619 seconds Still Render, 0.042 seconds OpenGL Dev Render, 0.01 seconds Render Server, Process 0 RenderView::Update, 4.134 seconds vtkPVView::Update, 4.132 seconds Execute vtkStreamTracer id: 2193, 3.941 seconds FullRes Data Migration, 0.567 seconds Dataserver gathering to 0, 0.318 seconds Dataserver sending to client, 0.243 seconds Render Server, Process 1 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 2 Execute vtkStreamTracer id: 2193
[Paraview] Parallel Streamtracer
Hello, I have a question related to the parallelism of the stream tracer: As I understand the code right, each line integration (trace) is processed in an own MPI process. Right? To test the scalability of the Stream tracer I've load a structured (curvilinear) grid and applied the filter with a Seed resolution of 1500 and check the timings in a single and multi-thread (Multi Core enabled in PV GUI) situation. I was really surprised that multi core slows done the execution time to 4 seconds. The single core takes only 1.2 seconds. Data migration cannot be the explanation for that behavior (0.5 seconds). What is the problem here? Please see attached some statistics... Data: * Structured (Curvilinear) Grid * 244030 Cells * 37 MB Memory System: * Intel i7-2600K (4 Cores + HT = 8 Threads) * 16 GB Ram * Windows 7 64 Bit * ParaView (master-branch, 64 bit compilation) # Single Thread (Seed resolution 1500): # Local Process Still Render, 0.014 seconds RenderView::Update, 1.222 seconds vtkPVView::Update, 1.222 seconds Execute vtkStreamTracer id: 2184, 1.214 seconds Still Render, 0.015 seconds # Eight Threads (Seed resolution 1500): # Local Process Still Render, 0.029 seconds RenderView::Update, 4.134 seconds vtkSMDataDeliveryManager: Deliver Geome, 0.619 seconds FullRes Data Migration, 0.619 seconds Still Render, 0.042 seconds OpenGL Dev Render, 0.01 seconds Render Server, Process 0 RenderView::Update, 4.134 seconds vtkPVView::Update, 4.132 seconds Execute vtkStreamTracer id: 2193, 3.941 seconds FullRes Data Migration, 0.567 seconds Dataserver gathering to 0, 0.318 seconds Dataserver sending to client, 0.243 seconds Render Server, Process 1 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 2 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 3 Execute vtkStreamTracer id: 2193, 4.12 seconds Render Server, Process 4 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 5 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 6 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 7 Execute vtkStreamTracer id: 2193, 3.939 seconds Cheers, Stephan ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
Re: [Paraview] Parallel Streamtracer
Hi Stephan, The parallel stream tracer uses the partitioning of the grid to determine which process does the integration. When the streamline exits the subdomain of a process there is a search to see if it enters a subdomain assigned to any other processes before figuring it whether it has left the entire domain. Leo, copied here, has been improving the streamline implementation inside of VTK so you may want to get his newer version. It is a pretty tough algorithm to parallelize efficiently without making any assumptions on the flow or partitioning. Andy On Thu, May 31, 2012 at 4:16 AM, Stephan Rogge stephan.ro...@tu-cottbus.dewrote: Hello, I have a question related to the parallelism of the stream tracer: As I understand the code right, each line integration (trace) is processed in an own MPI process. Right? To test the scalability of the Stream tracer I've load a structured (curvilinear) grid and applied the filter with a Seed resolution of 1500 and check the timings in a single and multi-thread (Multi Core enabled in PV GUI) situation. I was really surprised that multi core slows done the execution time to 4 seconds. The single core takes only 1.2 seconds. Data migration cannot be the explanation for that behavior (0.5 seconds). What is the problem here? Please see attached some statistics... Data: * Structured (Curvilinear) Grid * 244030 Cells * 37 MB Memory System: * Intel i7-2600K (4 Cores + HT = 8 Threads) * 16 GB Ram * Windows 7 64 Bit * ParaView (master-branch, 64 bit compilation) # Single Thread (Seed resolution 1500): # Local Process Still Render, 0.014 seconds RenderView::Update, 1.222 seconds vtkPVView::Update, 1.222 seconds Execute vtkStreamTracer id: 2184, 1.214 seconds Still Render, 0.015 seconds # Eight Threads (Seed resolution 1500): # Local Process Still Render, 0.029 seconds RenderView::Update, 4.134 seconds vtkSMDataDeliveryManager: Deliver Geome, 0.619 seconds FullRes Data Migration, 0.619 seconds Still Render, 0.042 seconds OpenGL Dev Render, 0.01 seconds Render Server, Process 0 RenderView::Update, 4.134 seconds vtkPVView::Update, 4.132 seconds Execute vtkStreamTracer id: 2193, 3.941 seconds FullRes Data Migration, 0.567 seconds Dataserver gathering to 0, 0.318 seconds Dataserver sending to client, 0.243 seconds Render Server, Process 1 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 2 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 3 Execute vtkStreamTracer id: 2193, 4.12 seconds Render Server, Process 4 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 5 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 6 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 7 Execute vtkStreamTracer id: 2193, 3.939 seconds Cheers, Stephan ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
Re: [Paraview] Parallel Streamtracer
Hi, Stephan, The previous implementation only has serial performance: It traces the streamlines one at a time and never starts a new streamline until the previous one finishes. With communication overhead, it is not surprising it got slower. My new implementation is able to let the processes working on different streamlines simultaneously and should scale much better. Leo On Thu, May 31, 2012 at 11:27 AM, Andy Bauer andy.ba...@kitware.com wrote: Hi Stephan, The parallel stream tracer uses the partitioning of the grid to determine which process does the integration. When the streamline exits the subdomain of a process there is a search to see if it enters a subdomain assigned to any other processes before figuring it whether it has left the entire domain. Leo, copied here, has been improving the streamline implementation inside of VTK so you may want to get his newer version. It is a pretty tough algorithm to parallelize efficiently without making any assumptions on the flow or partitioning. Andy On Thu, May 31, 2012 at 4:16 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hello, I have a question related to the parallelism of the stream tracer: As I understand the code right, each line integration (trace) is processed in an own MPI process. Right? To test the scalability of the Stream tracer I've load a structured (curvilinear) grid and applied the filter with a Seed resolution of 1500 and check the timings in a single and multi-thread (Multi Core enabled in PV GUI) situation. I was really surprised that multi core slows done the execution time to 4 seconds. The single core takes only 1.2 seconds. Data migration cannot be the explanation for that behavior (0.5 seconds). What is the problem here? Please see attached some statistics... Data: * Structured (Curvilinear) Grid * 244030 Cells * 37 MB Memory System: * Intel i7-2600K (4 Cores + HT = 8 Threads) * 16 GB Ram * Windows 7 64 Bit * ParaView (master-branch, 64 bit compilation) # Single Thread (Seed resolution 1500): # Local Process Still Render, 0.014 seconds RenderView::Update, 1.222 seconds vtkPVView::Update, 1.222 seconds Execute vtkStreamTracer id: 2184, 1.214 seconds Still Render, 0.015 seconds # Eight Threads (Seed resolution 1500): # Local Process Still Render, 0.029 seconds RenderView::Update, 4.134 seconds vtkSMDataDeliveryManager: Deliver Geome, 0.619 seconds FullRes Data Migration, 0.619 seconds Still Render, 0.042 seconds OpenGL Dev Render, 0.01 seconds Render Server, Process 0 RenderView::Update, 4.134 seconds vtkPVView::Update, 4.132 seconds Execute vtkStreamTracer id: 2193, 3.941 seconds FullRes Data Migration, 0.567 seconds Dataserver gathering to 0, 0.318 seconds Dataserver sending to client, 0.243 seconds Render Server, Process 1 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 2 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 3 Execute vtkStreamTracer id: 2193, 4.12 seconds Render Server, Process 4 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 5 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 6 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 7 Execute vtkStreamTracer id: 2193, 3.939 seconds Cheers, Stephan ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
Re: [Paraview] Parallel Streamtracer
Hi, Andy and Leo, thanks for your replies. Is it possible to get this new implementation? I would to give it a try. Regards, Stephan Am 31.05.2012 um 17:48 schrieb Yuanxin Liu leo@kitware.com: Hi, Stephan, The previous implementation only has serial performance: It traces the streamlines one at a time and never starts a new streamline until the previous one finishes. With communication overhead, it is not surprising it got slower. My new implementation is able to let the processes working on different streamlines simultaneously and should scale much better. Leo On Thu, May 31, 2012 at 11:27 AM, Andy Bauer andy.ba...@kitware.com wrote: Hi Stephan, The parallel stream tracer uses the partitioning of the grid to determine which process does the integration. When the streamline exits the subdomain of a process there is a search to see if it enters a subdomain assigned to any other processes before figuring it whether it has left the entire domain. Leo, copied here, has been improving the streamline implementation inside of VTK so you may want to get his newer version. It is a pretty tough algorithm to parallelize efficiently without making any assumptions on the flow or partitioning. Andy On Thu, May 31, 2012 at 4:16 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hello, I have a question related to the parallelism of the stream tracer: As I understand the code right, each line integration (trace) is processed in an own MPI process. Right? To test the scalability of the Stream tracer I've load a structured (curvilinear) grid and applied the filter with a Seed resolution of 1500 and check the timings in a single and multi-thread (Multi Core enabled in PV GUI) situation. I was really surprised that multi core slows done the execution time to 4 seconds. The single core takes only 1.2 seconds. Data migration cannot be the explanation for that behavior (0.5 seconds). What is the problem here? Please see attached some statistics... Data: * Structured (Curvilinear) Grid * 244030 Cells * 37 MB Memory System: * Intel i7-2600K (4 Cores + HT = 8 Threads) * 16 GB Ram * Windows 7 64 Bit * ParaView (master-branch, 64 bit compilation) # Single Thread (Seed resolution 1500): # Local Process Still Render, 0.014 seconds RenderView::Update, 1.222 seconds vtkPVView::Update, 1.222 seconds Execute vtkStreamTracer id: 2184, 1.214 seconds Still Render, 0.015 seconds # Eight Threads (Seed resolution 1500): # Local Process Still Render, 0.029 seconds RenderView::Update, 4.134 seconds vtkSMDataDeliveryManager: Deliver Geome, 0.619 seconds FullRes Data Migration, 0.619 seconds Still Render, 0.042 seconds OpenGL Dev Render, 0.01 seconds Render Server, Process 0 RenderView::Update, 4.134 seconds vtkPVView::Update, 4.132 seconds Execute vtkStreamTracer id: 2193, 3.941 seconds FullRes Data Migration, 0.567 seconds Dataserver gathering to 0, 0.318 seconds Dataserver sending to client, 0.243 seconds Render Server, Process 1 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 2 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 3 Execute vtkStreamTracer id: 2193, 4.12 seconds Render Server, Process 4 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 5 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 6 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 7 Execute vtkStreamTracer id: 2193, 3.939 seconds Cheers, Stephan ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
Re: [Paraview] Parallel Streamtracer
It is in the current VTK and ParaView master. The class is vtkPStreamTracer. Leo On Thu, May 31, 2012 at 3:31 PM, Stephan Rogge stephan.ro...@tu-cottbus.dewrote: Hi, Andy and Leo, thanks for your replies. Is it possible to get this new implementation? I would to give it a try. Regards, Stephan Am 31.05.2012 um 17:48 schrieb Yuanxin Liu leo@kitware.com: Hi, Stephan, The previous implementation only has serial performance: It traces the streamlines one at a time and never starts a new streamline until the previous one finishes. With communication overhead, it is not surprising it got slower. My new implementation is able to let the processes working on different streamlines simultaneously and should scale much better. Leo On Thu, May 31, 2012 at 11:27 AM, Andy Bauer andy.ba...@kitware.comwrote: Hi Stephan, The parallel stream tracer uses the partitioning of the grid to determine which process does the integration. When the streamline exits the subdomain of a process there is a search to see if it enters a subdomain assigned to any other processes before figuring it whether it has left the entire domain. Leo, copied here, has been improving the streamline implementation inside of VTK so you may want to get his newer version. It is a pretty tough algorithm to parallelize efficiently without making any assumptions on the flow or partitioning. Andy On Thu, May 31, 2012 at 4:16 AM, Stephan Rogge stephan.ro...@tu-cottbus.de wrote: Hello, I have a question related to the parallelism of the stream tracer: As I understand the code right, each line integration (trace) is processed in an own MPI process. Right? To test the scalability of the Stream tracer I've load a structured (curvilinear) grid and applied the filter with a Seed resolution of 1500 and check the timings in a single and multi-thread (Multi Core enabled in PV GUI) situation. I was really surprised that multi core slows done the execution time to 4 seconds. The single core takes only 1.2 seconds. Data migration cannot be the explanation for that behavior (0.5 seconds). What is the problem here? Please see attached some statistics... Data: * Structured (Curvilinear) Grid * 244030 Cells * 37 MB Memory System: * Intel i7-2600K (4 Cores + HT = 8 Threads) * 16 GB Ram * Windows 7 64 Bit * ParaView (master-branch, 64 bit compilation) # Single Thread (Seed resolution 1500): # Local Process Still Render, 0.014 seconds RenderView::Update, 1.222 seconds vtkPVView::Update, 1.222 seconds Execute vtkStreamTracer id: 2184, 1.214 seconds Still Render, 0.015 seconds # Eight Threads (Seed resolution 1500): # Local Process Still Render, 0.029 seconds RenderView::Update, 4.134 seconds vtkSMDataDeliveryManager: Deliver Geome, 0.619 seconds FullRes Data Migration, 0.619 seconds Still Render, 0.042 seconds OpenGL Dev Render, 0.01 seconds Render Server, Process 0 RenderView::Update, 4.134 seconds vtkPVView::Update, 4.132 seconds Execute vtkStreamTracer id: 2193, 3.941 seconds FullRes Data Migration, 0.567 seconds Dataserver gathering to 0, 0.318 seconds Dataserver sending to client, 0.243 seconds Render Server, Process 1 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 2 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 3 Execute vtkStreamTracer id: 2193, 4.12 seconds Render Server, Process 4 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 5 Execute vtkStreamTracer id: 2193, 3.939 seconds Render Server, Process 6 Execute vtkStreamTracer id: 2193, 3.938 seconds Render Server, Process 7 Execute vtkStreamTracer id: 2193, 3.939 seconds Cheers, Stephan ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview ___ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview