RE: Query related to spark cluster
Hi All, @Deepak: Thanks for your suggestion, we are using Mesos to handle spark cluster. @Jorn : the reason we chose postgresXL was of its geo-spational support as we store location data. We were seeing how to quickly put things better and what is the right approach Our original thinking was to use different cluster for different needs. Eg. Instead of 1 cluster we were thinking having 3 cluster 1) Spark cluster -- including HDFS we need HDFS because we have to read data from an SFTP location and we thought best is if we write it first to HDFS 2) Distributed R cluster since R does not scale and we have a need for scaling and no time to move to SparkR we thought we try distributed R. 3) PostgresXL cluster -- This is the DB cluster so the Spark cluster would write to PostgresXl cluster and R will read/write to postgresXL cluster In current setup we have included all component into same cluster. Can you please help me out to choose best approach which will not compromise scalability and failover mechanism? Regards, Saurabh From: Deepak Sharma [mailto:deepakmc...@gmail.com] Sent: Monday, May 30, 2016 12:17 PM To: Jörn Franke Cc: Kumar, Saurabh 5. (Nokia - IN/Bangalore) ; user@spark.apache.org; Sawhney, Prerna (Nokia - IN/Bangalore) Subject: Re: Query related to spark cluster Hi Saurabh You can have hadoop cluster running YARN as scheduler. Configure spark to run with the same YARN setup. Then you need R only on 1 node , and connect to the cluster using the SparkR. Thanks Deepak On Mon, May 30, 2016 at 12:12 PM, Jörn Franke mailto:jornfra...@gmail.com>> wrote: Well if you require R then you need to install it (including all additional packages) on each node. I am not sure why you store the data in Postgres . Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant columns) and you use the SparkR libraries to access them. On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) mailto:saurabh.5.ku...@nokia.com>> wrote: Hi Team, I am using Apache spark to build scalable Analytic engine. My setup is as follows. Flow of processing is as follows: Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > R process data fom Postgre-XL to process in distributed mode. I have 6 nodes cluster setup for ETL operations which have 1. Spark slaves installed on all 6 of them. 2. HDFS data nodes on each of 6 nodes with replication factor 2. 3. PosGRE –XL 9.5 Database coordinator on each of 6 nodes. 4. R software is installed on all nodes and Uses process Data from Postgre-XL in distributed manner. Can you please guide me about pros and cons of this setup. Installing all component on every machines is recommended or there is any drawback? R software should run on spark cluster ? Thanks & Regards Saurabh Kumar R&D Engineer, T&I TED Technology Explorat&Disruption Nokia Networks L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045 Mobile: +91-8861012418 http://networks.nokia.com/ -- Thanks Deepak www.bigdatabig.com<http://www.bigdatabig.com> www.keosha.net<http://www.keosha.net> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Query related to spark cluster
Hi Saurabh You can have hadoop cluster running YARN as scheduler. Configure spark to run with the same YARN setup. Then you need R only on 1 node , and connect to the cluster using the SparkR. Thanks Deepak On Mon, May 30, 2016 at 12:12 PM, Jörn Franke wrote: > > Well if you require R then you need to install it (including all > additional packages) on each node. I am not sure why you store the data in > Postgres . Storing it in Parquet and Orc is sufficient in HDFS (sorted on > relevant columns) and you use the SparkR libraries to access them. > > On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) < > saurabh.5.ku...@nokia.com> wrote: > > Hi Team, > > I am using Apache spark to build scalable Analytic engine. My setup is as > follows. > > Flow of processing is as follows: > > Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL > Database > R process data fom Postgre-XL to process in distributed mode. > > I have 6 nodes cluster setup for ETL operations which have > > >1. Spark slaves installed on all 6 of them. >2. HDFS data nodes on each of 6 nodes with replication factor 2. >3. PosGRE –XL 9.5 Database coordinator on each of 6 nodes. >4. R software is installed on all nodes and Uses process Data from >Postgre-XL in distributed manner. > > > > > > Can you please guide me about pros and cons of this setup. > Installing all component on every machines is recommended or there is any > drawback? > R software should run on spark cluster ? > > > > Thanks & Regards > Saurabh Kumar > R&D Engineer, T&I TED Technology Explorat&Disruption > Nokia Networks > L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045 > Mobile: +91-8861012418 > http://networks.nokia.com/ > > > > > -- Thanks Deepak www.bigdatabig.com www.keosha.net
RE: Query related to spark cluster
H Jorn, Thanks for suggestion. My current cluster setup is mentioned in attached snapshot .Apart from PotgreXL do you see any problem over there? Regards, Saurabh From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: Monday, May 30, 2016 12:12 PM To: Kumar, Saurabh 5. (Nokia - IN/Bangalore) Cc: user@spark.apache.org; Sawhney, Prerna (Nokia - IN/Bangalore) Subject: Re: Query related to spark cluster Well if you require R then you need to install it (including all additional packages) on each node. I am not sure why you store the data in Postgres . Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant columns) and you use the SparkR libraries to access them. On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) mailto:saurabh.5.ku...@nokia.com>> wrote: Hi Team, I am using Apache spark to build scalable Analytic engine. My setup is as follows. Flow of processing is as follows: Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > R process data fom Postgre-XL to process in distributed mode. I have 6 nodes cluster setup for ETL operations which have 1. Spark slaves installed on all 6 of them. 2. HDFS data nodes on each of 6 nodes with replication factor 2. 3. PosGRE –XL 9.5 Database coordinator on each of 6 nodes. 4. R software is installed on all nodes and Uses process Data from Postgre-XL in distributed manner. Can you please guide me about pros and cons of this setup. Installing all component on every machines is recommended or there is any drawback? R software should run on spark cluster ? Thanks & Regards Saurabh Kumar R&D Engineer, T&I TED Technology Explorat&Disruption Nokia Networks L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045 Mobile: +91-8861012418 http://networks.nokia.com/ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Query related to spark cluster
Well if you require R then you need to install it (including all additional packages) on each node. I am not sure why you store the data in Postgres . Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant columns) and you use the SparkR libraries to access them. > On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) > wrote: > > Hi Team, > > I am using Apache spark to build scalable Analytic engine. My setup is as > follows. > > Flow of processing is as follows: > > Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > > R process data fom Postgre-XL to process in distributed mode. > > I have 6 nodes cluster setup for ETL operations which have > > Spark slaves installed on all 6 of them. > HDFS data nodes on each of 6 nodes with replication factor 2. > PosGRE –XL 9.5 Database coordinator on each of 6 nodes. > R software is installed on all nodes and Uses process Data from Postgre-XL in > distributed manner. > > > > > Can you please guide me about pros and cons of this setup. > Installing all component on every machines is recommended or there is any > drawback? > R software should run on spark cluster ? > > > > Thanks & Regards > Saurabh Kumar > R&D Engineer, T&I TED Technology Explorat&Disruption > Nokia Networks > L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045 > Mobile: +91-8861012418 > http://networks.nokia.com/ > > >
Query related to spark cluster
Hi Team, I am using Apache spark to build scalable Analytic engine. My setup is as follows. Flow of processing is as follows: Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > R process data fom Postgre-XL to process in distributed mode. I have 6 nodes cluster setup for ETL operations which have 1. Spark slaves installed on all 6 of them. 2. HDFS data nodes on each of 6 nodes with replication factor 2. 3. PosGRE -XL 9.5 Database coordinator on each of 6 nodes. 4. R software is installed on all nodes and Uses process Data from Postgre-XL in distributed manner. Can you please guide me about pros and cons of this setup. Installing all component on every machines is recommended or there is any drawback? R software should run on spark cluster ? Thanks & Regards Saurabh Kumar R&D Engineer, T&I TED Technology Explorat&Disruption Nokia Networks L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045 Mobile: +91-8861012418 http://networks.nokia.com/