RE: Running R codes in sparkR

2016-05-31 Thread Kumar, Saurabh 5. (Nokia - IN/Bangalore)
Hi Arunkumar,

sparkR has very limited functionality than R , and few of datatypes like 'data 
table'  in R is not there sparkR . So you need to be check compatibility of you 
R code carefully with sparkR.

Regards,
Saurabh

-Original Message-
From: mylistt...@gmail.com [mailto:mylistt...@gmail.com] 
Sent: Tuesday, May 31, 2016 6:35 PM
To: Arunkumar Pillai 
Cc: user 
Subject: Re: Running R codes in sparkR

Hi Arunkumar ,

Yes , R can be integrated with Spark to give you SparkR. There are a couple of 
blogs on the net. The Spark dev page has it too.

https://spark.apache.org/docs/latest/sparkr.html



Just remember that all packages of R that you may have worked on in R are not 
supported in SparkR. There are a good set of R packages in SparkR. 

As I have understood you cannot run sapply etc for example. The constraint 
being these packages need to be ported/coded for RDD's. The R community as I 
understand is not very deeply involved with the Spark community. - this I have 
understood by seeing you tube videos. 





On May 31, 2016, at 18:16, Arunkumar Pillai  wrote:

> Hi
> 
> I have some basic doubt regarding spark R.
> 
> 1. can we run R codes in spark using sparkR or some spark functionalities  
> that are executed in spark through R.
> 
> 
> 
> -- 
> Thanks and Regards
>Arun


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Query related to spark cluster

2016-05-30 Thread Kumar, Saurabh 5. (Nokia - IN/Bangalore)
Hi All,

@Deepak: Thanks for your suggestion, we are using Mesos to handle spark cluster.

@Jorn : the reason we chose postgresXL was of its geo-spational support as we 
store location data.

We were seeing how to quickly put things better and what is the right approach

Our original thinking was to use different cluster for different needs.

Eg.  Instead of 1 cluster we were thinking having 3 cluster

1) Spark cluster -- including HDFS we need HDFS because we have to read data 
from an SFTP location and we thought best is if we write it first to HDFS

2) Distributed R cluster since R does not scale and we have a need for scaling 
and no time to move to SparkR we thought we try distributed R.

3) PostgresXL cluster -- This is the DB cluster so the Spark cluster would 
write to PostgresXl cluster and R will read/write to postgresXL cluster

In current setup we have included all component into same cluster. Can you 
please help me out to choose best approach which will not compromise 
scalability and failover mechanism?


Regards,
Saurabh



From: Deepak Sharma [mailto:deepakmc...@gmail.com]
Sent: Monday, May 30, 2016 12:17 PM
To: Jörn Franke <jornfra...@gmail.com>
Cc: Kumar, Saurabh 5. (Nokia - IN/Bangalore) <saurabh.5.ku...@nokia.com>; 
user@spark.apache.org; Sawhney, Prerna (Nokia - IN/Bangalore) 
<prerna.sawh...@nokia.com>
Subject: Re: Query related to spark cluster

Hi Saurabh
You can have hadoop cluster running YARN as scheduler.
Configure spark to run with the same YARN setup.
Then you need R only on 1 node , and connect to the cluster using the SparkR.

Thanks
Deepak

On Mon, May 30, 2016 at 12:12 PM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:

Well if you require R then you need to install it (including all additional 
packages) on each node. I am not sure why you store the data in Postgres . 
Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant 
columns) and you use the SparkR libraries to access them.

On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) 
<saurabh.5.ku...@nokia.com<mailto:saurabh.5.ku...@nokia.com>> wrote:
Hi Team,

I am using Apache spark to build scalable Analytic engine. My setup is as 
follows.

Flow of processing is as follows:

Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > 
R process data fom Postgre-XL to process in distributed mode.

I have 6 nodes cluster setup for ETL operations which have

1.  Spark slaves installed on all 6 of them.
2.  HDFS data nodes on each of 6 nodes with replication factor 2.
3.  PosGRE –XL 9.5 Database coordinator on each of 6 nodes.
4.  R software is installed on all nodes and Uses process Data from 
Postgre-XL in distributed manner.




Can you please guide me about pros and cons of this setup.
Installing all component on every machines is recommended or there is any 
drawback?
R software should run on spark cluster ?



Thanks & Regards
Saurabh Kumar
R Engineer, T TED Technology Explorat
Nokia Networks
L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045
Mobile: +91-8861012418<tel:%2B91-8861012418>
http://networks.nokia.com/






--
Thanks
Deepak
www.bigdatabig.com<http://www.bigdatabig.com>
www.keosha.net<http://www.keosha.net>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Query related to spark cluster

2016-05-30 Thread Kumar, Saurabh 5. (Nokia - IN/Bangalore)
H Jorn,

Thanks for suggestion.

My current cluster setup is mentioned in attached snapshot .Apart from PotgreXL 
do you see any problem over there?


Regards,
Saurabh

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Monday, May 30, 2016 12:12 PM
To: Kumar, Saurabh 5. (Nokia - IN/Bangalore) <saurabh.5.ku...@nokia.com>
Cc: user@spark.apache.org; Sawhney, Prerna (Nokia - IN/Bangalore) 
<prerna.sawh...@nokia.com>
Subject: Re: Query related to spark cluster


Well if you require R then you need to install it (including all additional 
packages) on each node. I am not sure why you store the data in Postgres . 
Storing it in Parquet and Orc is sufficient in HDFS (sorted on relevant 
columns) and you use the SparkR libraries to access them.

On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) 
<saurabh.5.ku...@nokia.com<mailto:saurabh.5.ku...@nokia.com>> wrote:
Hi Team,

I am using Apache spark to build scalable Analytic engine. My setup is as 
follows.

Flow of processing is as follows:

Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > 
R process data fom Postgre-XL to process in distributed mode.

I have 6 nodes cluster setup for ETL operations which have

1.  Spark slaves installed on all 6 of them.
2.  HDFS data nodes on each of 6 nodes with replication factor 2.
3.  PosGRE –XL 9.5 Database coordinator on each of 6 nodes.
4.  R software is installed on all nodes and Uses process Data from 
Postgre-XL in distributed manner.




Can you please guide me about pros and cons of this setup.
Installing all component on every machines is recommended or there is any 
drawback?
R software should run on spark cluster ?



Thanks & Regards
Saurabh Kumar
R Engineer, T TED Technology Explorat
Nokia Networks
L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045
Mobile: +91-8861012418
http://networks.nokia.com/




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Query related to spark cluster

2016-05-30 Thread Kumar, Saurabh 5. (Nokia - IN/Bangalore)
Hi Team,

I am using Apache spark to build scalable Analytic engine. My setup is as 
follows.

Flow of processing is as follows:

Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL Database > 
R process data fom Postgre-XL to process in distributed mode.

I have 6 nodes cluster setup for ETL operations which have

1.  Spark slaves installed on all 6 of them.
2.  HDFS data nodes on each of 6 nodes with replication factor 2.
3.  PosGRE -XL 9.5 Database coordinator on each of 6 nodes.
4.  R software is installed on all nodes and Uses process Data from 
Postgre-XL in distributed manner.




  Can you please guide me about pros and cons of this setup.
  Installing all component on every machines is recommended or there is any 
drawback?
  R software should run on spark cluster ?



Thanks & Regards
Saurabh Kumar
R Engineer, T TED Technology Explorat
Nokia Networks
L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045
Mobile: +91-8861012418
http://networks.nokia.com/