Hi Team,

I am using Apache spark to build scalable Analytic engine. My setup is as 
follows.

Flow of processing is as follows:

Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > 
R process data fom Postgre-XL to process in distributed mode.

I have 6 nodes cluster setup for ETL operations which have

1.      Spark slaves installed on all 6 of them.
2.      HDFS data nodes on each of 6 nodes with replication factor 2.
3.      PosGRE -XL 9.5 Database coordinator on each of 6 nodes.
4.      R software is installed on all nodes and Uses process Data from 
Postgre-XL in distributed manner.




      Can you please guide me about pros and cons of this setup.
      Installing all component on every machines is recommended or there is any 
drawback?
      R software should run on spark cluster ?



Thanks & Regards
Saurabh Kumar
R&D Engineer, T&I TED Technology Explorat&Disruption
Nokia Networks
L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045
Mobile: +91-8861012418
http://networks.nokia.com/



Reply via email to