Hi,

> So if it is not possible to run 4 processes in parallel efficiently using C

That’s possible in C, and most machines have more that 4 cores so you could 
probably run more than 4 processes.

The issue is can you actually partition your data up into chunks? And how much 
communication do you need between these seperate running bits. From what you 
said it looks you can get down to a a one or a groups of houses vs all the 
others in a city, but you may have to do some collation at the end? Threads 
have less overhead than processes and it easier to community/share data between 
them.

If you go down the seperate processes round, then each would need to be query 
the DB for part of the data, work out the results and then update the results. 
OR with threads you could have a single process that reads all the data and 
then process a small chunk and then update small sections in a thread. If I was 
coding it this was I would make the number of threads configurable so I could 
see where the sweat spot was re how many thread to use on a given machine.

> So it basically boils down to this:  the speed of C plus the ability to run 
> several processes

C is generally (very) fast and you can run a large number of processes/threads 
until you run out of I/O or CPU or memory. Currently my laptop is running 300+ 
processors and 2000+ threads on a 8 core CPU but not many of them are doing 
that much and the CPUs are 90% idle.

> in parallel VS the scaleability of Spark to process big datasets.

Spark will scale up to 1000’s of machines and work with data petabytes in size.

Thanks,
Justin

Reply via email to