Hi, > So if it is not possible to run 4 processes in parallel efficiently using C
That’s possible in C, and most machines have more that 4 cores so you could probably run more than 4 processes. The issue is can you actually partition your data up into chunks? And how much communication do you need between these seperate running bits. From what you said it looks you can get down to a a one or a groups of houses vs all the others in a city, but you may have to do some collation at the end? Threads have less overhead than processes and it easier to community/share data between them. If you go down the seperate processes round, then each would need to be query the DB for part of the data, work out the results and then update the results. OR with threads you could have a single process that reads all the data and then process a small chunk and then update small sections in a thread. If I was coding it this was I would make the number of threads configurable so I could see where the sweat spot was re how many thread to use on a given machine. > So it basically boils down to this: the speed of C plus the ability to run > several processes C is generally (very) fast and you can run a large number of processes/threads until you run out of I/O or CPU or memory. Currently my laptop is running 300+ processors and 2000+ threads on a 8 core CPU but not many of them are doing that much and the CPUs are 90% idle. > in parallel VS the scaleability of Spark to process big datasets. Spark will scale up to 1000’s of machines and work with data petabytes in size. Thanks, Justin
