How to convert a Random Forest model built in R to a similar model in Spark

2016-06-26 Thread Neha Mehta
Hi All,

Request help with problem mentioned in the mail below. I have an existing
random forest model in R which needs to be deployed on Spark. I am trying
to recreate the model in Spark but facing the problem mentioned below.

Thanks,
Neha

On Jun 24, 2016 5:10 PM, wrote:
>
> Hi Sun,
>
> I am trying to build a model in Spark. Here are the parameters which were
used in R for creating the model, I am unable to figure out how to specify
a similar input to the random forest regressor in Spark so that I get a
similar model in Spark.
>
> https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
>
> mytry=3
>
> ntree=500
>
> importance=TRUE
>
> maxnodes = NULL
>
> On May 31, 2016 7:03 AM, "Sun Rui" <sunrise_...@163.com> wrote:
>>
>> I mean train random forest model (not using R) and use it for prediction
together using Spark ML.
>>
>>> On May 30, 2016, at 20:15, Neha Mehta <nehamehta...@gmail.com> wrote:
>>>
>>> Thanks Sujeet.. will try it out.
>>>
>>> Hi Sun,
>>>
>>> Can you please tell me what did you mean by "Maybe you can try using
the existing random forest model" ? did you mean creating the model again
using Spark MLLIB?
>>>
>>> Thanks,
>>> Neha
>>>
>>>
>>>
>>>>
>>>> From: sujeet jog <sujeet@gmail.com>
>>>> Date: Mon, May 30, 2016 at 4:52 PM
>>>> Subject: Re: Can we use existing R model in Spark
>>>> To: Sun Rui <sunrise_...@163.com>
>>>> Cc: Neha Mehta <nehamehta...@gmail.com>, user <user@spark.apache.org>
>>>>
>>>>
>>>> Try to invoke a R script from Spark using rdd pipe method , get the
work done & and receive the model back in RDD.
>>>>
>>>>
>>>> for ex :-
>>>> .   rdd.pipe("")
>>>>
>>>>
>>>> On Mon, May 30, 2016 at 3:57 PM, Sun Rui <sunrise_...@163.com> wrote:
>>>>>
>>>>> Unfortunately no. Spark does not support loading external modes (for
examples, PMML) for now.
>>>>> Maybe you can try using the existing random forest model in Spark.
>>>>>
>>>>>> On May 30, 2016, at 18:21, Neha Mehta <nehamehta...@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have an existing random forest model created using R. I want to
use that to predict values on Spark. Is it possible to do the same? If yes,
then how?
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Neha
>>>>>
>>>>>
>>>>
>>>>
>>>
>>


How to convert a Random Forest model built in R to a similar model in Spark

2016-06-24 Thread Neha Mehta
Hi Sun,

I am trying to build a model in Spark. Here are the parameters which were
used in R for creating the model, I am unable to figure out how to specify
a similar input to the random forest regressor in Spark so that I get a
similar model in Spark.

https
<https://cran.r-project.org/web/packages/randomForest/randomForest.pdf>
://cran.r-project.org/web/packages/
<https://cran.r-project.org/web/packages/randomForest/randomForest.pdf>
randomForest
<https://cran.r-project.org/web/packages/randomForest/randomForest.pdf>
/randomForest.pdf
<https://cran.r-project.org/web/packages/randomForest/randomForest.pdf>

mytry=3

ntree=500

importance=TRUE

maxnodes = NULL
On May 31, 2016 7:03 AM, "Sun Rui" <sunrise_...@163.com> wrote:

I mean train random forest model (not using R) and use it for prediction
together using Spark ML.

On May 30, 2016, at 20:15, Neha Mehta <nehamehta...@gmail.com> wrote:

Thanks Sujeet.. will try it out.

Hi Sun,

Can you please tell me what did you mean by "Maybe you can try using the
existing random forest model" ? did you mean creating the model again using
Spark MLLIB?

Thanks,
Neha




> From: sujeet jog <sujeet@gmail.com>
> Date: Mon, May 30, 2016 at 4:52 PM
> Subject: Re: Can we use existing R model in Spark
> To: Sun Rui <sunrise_...@163.com>
> Cc: Neha Mehta <nehamehta...@gmail.com>, user <user@spark.apache.org>
>
>
> Try to invoke a R script from Spark using rdd pipe method , get the work
> done & and receive the model back in RDD.
>
>
> for ex :-
> .   rdd.pipe("")
>
>
> On Mon, May 30, 2016 at 3:57 PM, Sun Rui <sunrise_...@163.com> wrote:
>
>> Unfortunately no. Spark does not support loading external modes (for
>> examples, PMML) for now.
>> Maybe you can try using the existing random forest model in Spark.
>>
>> On May 30, 2016, at 18:21, Neha Mehta <nehamehta...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have an existing random forest model created using R. I want to use
>> that to predict values on Spark. Is it possible to do the same? If yes,
>> then how?
>>
>> Thanks & Regards,
>> Neha
>>
>>
>>
>
>


Re: Ignore features in Random Forest

2016-06-02 Thread Neha Mehta
Thanks Yuhao.

Regards,
Neha

On Thu, Jun 2, 2016 at 11:51 AM, Yuhao Yang <hhb...@gmail.com> wrote:

> Hi Neha,
>
> This looks like a feature engineering task. I think VectorSlicer can help
> with your case. Please refer to
> http://spark.apache.org/docs/latest/ml-features.html#vectorslicer .
>
> Regards,
> Yuhao
>
> 2016-06-01 21:18 GMT+08:00 Neha Mehta <nehamehta...@gmail.com>:
>
>> Hi,
>>
>> I am performing Regression using Random Forest. In my input vector, I
>> want the algorithm to ignore certain columns/features while training the
>> classifier and also while prediction. These are basically Id columns. I
>> checked the documentation and could not find any information on the same.
>>
>> Request help with the same.
>>
>> Thanks & Regards,
>> Neha
>>
>
>


Ignore features in Random Forest

2016-06-01 Thread Neha Mehta
Hi,

I am performing Regression using Random Forest. In my input vector, I want
the algorithm to ignore certain columns/features while training the
classifier and also while prediction. These are basically Id columns. I
checked the documentation and could not find any information on the same.

Request help with the same.

Thanks & Regards,
Neha


Re: Can we use existing R model in Spark

2016-05-30 Thread Neha Mehta
Thanks Sujeet.. will try it out.

Hi Sun,

Can you please tell me what did you mean by "Maybe you can try using the
existing random forest model" ? did you mean creating the model again using
Spark MLLIB?

Thanks,
Neha




> From: sujeet jog <sujeet@gmail.com>
> Date: Mon, May 30, 2016 at 4:52 PM
> Subject: Re: Can we use existing R model in Spark
> To: Sun Rui <sunrise_...@163.com>
> Cc: Neha Mehta <nehamehta...@gmail.com>, user <user@spark.apache.org>
>
>
> Try to invoke a R script from Spark using rdd pipe method , get the work
> done & and receive the model back in RDD.
>
>
> for ex :-
> .   rdd.pipe("")
>
>
> On Mon, May 30, 2016 at 3:57 PM, Sun Rui <sunrise_...@163.com> wrote:
>
>> Unfortunately no. Spark does not support loading external modes (for
>> examples, PMML) for now.
>> Maybe you can try using the existing random forest model in Spark.
>>
>> On May 30, 2016, at 18:21, Neha Mehta <nehamehta...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have an existing random forest model created using R. I want to use
>> that to predict values on Spark. Is it possible to do the same? If yes,
>> then how?
>>
>> Thanks & Regards,
>> Neha
>>
>>
>>
>
>


Can we use existing R model in Spark

2016-05-30 Thread Neha Mehta
Hi,

I have an existing random forest model created using R. I want to use that
to predict values on Spark. Is it possible to do the same? If yes, then how?

Thanks & Regards,
Neha


Re: How to call a custom function from GroupByKey which takes Iterable[Row] as input and returns a Map[Int,String] as output in scala

2016-01-20 Thread Neha Mehta
Hi Vishal,

Thanks for the solution. I was able to get it working for my scenario.
Regarding the Task not serializable error, I still get it when I declare a
function outside the main method. However, if I declare it inside the main
"val func = {}", it is working fine for me.

In case you have any insight to share on the same, then please do share it.

Thanks for the help.

Regards,
Neha

On Wed, Jan 20, 2016 at 11:39 AM, Vishal Maru <vzm...@gmail.com> wrote:

> It seems Spark is not able to serialize your function code to worker nodes.
>
> I have tried to put a solution in simple set of commands. Maybe you can
> combine last four line into function.
>
>
> val arr = Array((1,"A","<20","0"), (1,"A",">20 & <40","1"), (1,"B",">20 &
> <40","0"), (1,"C",">20 & <40","0"), (1,"C",">20 & <40","0"),
> (2,"A","<20","0"), (3,"B",">20 & <40","1"), (3,"B",">40","2"))
>
> val rdd = sc.parallelize(arr)
>
> val prdd = rdd.map(a => (a._1,a))
> val totals = prdd.groupByKey.map(a => (a._1, a._2.size))
>
> var n1 = rdd.map(a => ((a._1, a._2), 1) )
> var n2 = n1.reduceByKey(_+_).map(a => (a._1._1, (a._1._2, a._2)))
> var n3 = n2.join(totals).map(a => (a._1, (a._2._1._1, a._2._1._2.toDouble
> / a._2._2)))
> var n4 = n3.map(a => (a._1, a._2._1 + ":" +
> a._2._2.toString)).reduceByKey((a, b) => a + "|" + b)
>
> n4.collect.foreach(println)
>
>
>
>
> On Mon, Jan 18, 2016 at 6:47 AM, Neha Mehta <nehamehta...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a scenario wherein my dataset has around 30 columns. It is
>> basically user activity information. I need to group the information by
>> each user and then for each column/activity parameter I need to find the
>> percentage affinity for each value in that column for that user. Below is
>> the sample input and output.
>>
>> UserId C1 C2 C3
>> 1 A <20 0
>> 1 A >20 & <40 1
>> 1 B >20 & <40 0
>> 1 C >20 & <40 0
>> 1 C >20 & <40 0
>> 2 A <20 0
>> 3 B >20 & <40 1
>> 3 B >40 2
>>
>>
>>
>>
>>
>>
>>
>>
>> Output
>>
>>
>> 1 A:0.4|B:0.2|C:0.4 <20:02|>20 & <40:0.8 0:0.8|1:0.2
>> 2 A:1 <20:1 0:01
>> 3 B:1 >20 & <40:0.5|>40:0.5 1:0.5|2:0.5
>>
>> Presently this is how I am calculating these values:
>> Group by UserId and C1 and compute values for C1 for all the users, then
>> do a group by by Userid and C2 and find the fractions for C2 for each user
>> and so on. This approach is quite slow.  Also the number of records for
>> each user will be at max 30. So I would like to take a second approach
>> wherein I do a groupByKey and pass the entire list of records for each key
>> to a function which computes all the percentages for each column for each
>> user at once. Below are the steps I am trying to follow:
>>
>> 1. Dataframe1 => group by UserId , find the counts of records for each
>> user. Join the results back to the input so that counts are available with
>> each record
>> 2. Dataframe1.map(s=>s(1),s).groupByKey().map(s=>myUserAggregator(s._2))
>>
>> def myUserAggregator(rows: Iterable[Row]):
>> scala.collection.mutable.Map[Int,String] = {
>> val returnValue = scala.collection.mutable.Map[Int,String]()
>> if (rows != null) {
>>   val activityMap = scala.collection.mutable.Map[Int,
>> scala.collection.mutable.Map[String,
>> Int]]().withDefaultValue(scala.collection.mutable.Map[String,
>> Int]().withDefaultValue(0))
>>
>>   val rowIt = rows.iterator
>>   var sentCount = 1
>>   for (row <- rowIt) {
>> sentCount = row(29).toString().toInt
>> for (i <- 0 until row.length) {
>>   var m = activityMap(i)
>>   if (activityMap(i) == null) {
>> m = collection.mutable.Map[String,
>> Int]().withDefaultValue(0)
>>   }
>>   m(row(i).toString()) += 1
>>   activityMap.update(i, m)
>> }
>>   }
>>   var activityPPRow: Row = Row()
>>   for((k,v) <- activityMap) {
>>   var rowVal:String = ""
>>   for((a,b) <- v) {
>> rowVal += rowVal + a + ":" + b

How to call a custom function from GroupByKey which takes Iterable[Row] as input and returns a Map[Int,String] as output in scala

2016-01-18 Thread Neha Mehta
Hi,

I have a scenario wherein my dataset has around 30 columns. It is basically
user activity information. I need to group the information by each user and
then for each column/activity parameter I need to find the percentage
affinity for each value in that column for that user. Below is the sample
input and output.

UserId C1 C2 C3
1 A <20 0
1 A >20 & <40 1
1 B >20 & <40 0
1 C >20 & <40 0
1 C >20 & <40 0
2 A <20 0
3 B >20 & <40 1
3 B >40 2








Output


1 A:0.4|B:0.2|C:0.4 <20:02|>20 & <40:0.8 0:0.8|1:0.2
2 A:1 <20:1 0:01
3 B:1 >20 & <40:0.5|>40:0.5 1:0.5|2:0.5

Presently this is how I am calculating these values:
Group by UserId and C1 and compute values for C1 for all the users, then do
a group by by Userid and C2 and find the fractions for C2 for each user and
so on. This approach is quite slow.  Also the number of records for each
user will be at max 30. So I would like to take a second approach wherein I
do a groupByKey and pass the entire list of records for each key to a
function which computes all the percentages for each column for each user
at once. Below are the steps I am trying to follow:

1. Dataframe1 => group by UserId , find the counts of records for each
user. Join the results back to the input so that counts are available with
each record
2. Dataframe1.map(s=>s(1),s).groupByKey().map(s=>myUserAggregator(s._2))

def myUserAggregator(rows: Iterable[Row]):
scala.collection.mutable.Map[Int,String] = {
val returnValue = scala.collection.mutable.Map[Int,String]()
if (rows != null) {
  val activityMap = scala.collection.mutable.Map[Int,
scala.collection.mutable.Map[String,
Int]]().withDefaultValue(scala.collection.mutable.Map[String,
Int]().withDefaultValue(0))

  val rowIt = rows.iterator
  var sentCount = 1
  for (row <- rowIt) {
sentCount = row(29).toString().toInt
for (i <- 0 until row.length) {
  var m = activityMap(i)
  if (activityMap(i) == null) {
m = collection.mutable.Map[String,
Int]().withDefaultValue(0)
  }
  m(row(i).toString()) += 1
  activityMap.update(i, m)
}
  }
  var activityPPRow: Row = Row()
  for((k,v) <- activityMap) {
  var rowVal:String = ""
  for((a,b) <- v) {
rowVal += rowVal + a + ":" + b/sentCount + "|"
  }
  returnValue.update(k, rowVal)
//  activityPPRow.apply(k) = rowVal
  }

}
return returnValue
  }

When I run step 2 I get the following error. I am new to Scala and Spark
and am unable to figure out how to pass the Iterable[Row] to a function and
get back the results.

org.apache.spark.SparkException: Task not serializable
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.map(RDD.scala:317)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:97)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:104)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:106)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:108)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:110)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:112)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:114)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:116)
..


Thanks for the help.

Regards,
Neha Mehta


How to call a custom function from GroupByKey which takes Iterable[Row] as input and returns a Map[Int,String] as output in scala

2016-01-18 Thread Neha Mehta
Hi,

I have a scenario wherein my dataset has around 30 columns. It is basically
user activity information. I need to group the information by each user and
then for each column/activity parameter I need to find the percentage
affinity for each value in that column for that user. Below is the sample
input and output.

UserId C1 C2 C3
1 A <20 0
1 A >20 & <40 1
1 B >20 & <40 0
1 C >20 & <40 0
1 C >20 & <40 0
2 A <20 0
3 B >20 & <40 1
3 B >40 2








Output


1 A:0.4|B:0.2|C:0.4 <20:02|>20 & <40:0.8 0:0.8|1:0.2
2 A:1 <20:1 0:01
3 B:1 >20 & <40:0.5|>40:0.5 1:0.5|2:0.5

Presently this is how I am calculating these values:
Group by UserId and C1 and compute values for C1 for all the users, then do
a group by by Userid and C2 and find the fractions for C2 for each user and
so on. This approach is quite slow.  Also the number of records for each
user will be at max 30. So I would like to take a second approach wherein I
do a groupByKey and pass the entire list of records for each key to a
function which computes all the percentages for each column for each user
at once. Below are the steps I am trying to follow:

1. Dataframe1 => group by UserId , find the counts of records for each
user. Join the results back to the input so that counts are available with
each record
2. Dataframe1.map(s=>s(1),s).groupByKey().map(s=>myUserAggregator(s._2))

def myUserAggregator(rows: Iterable[Row]):
scala.collection.mutable.Map[Int,String] = {
val returnValue = scala.collection.mutable.Map[Int,String]()
if (rows != null) {
  val activityMap = scala.collection.mutable.Map[Int,
scala.collection.mutable.Map[String,
Int]]().withDefaultValue(scala.collection.mutable.Map[String,
Int]().withDefaultValue(0))

  val rowIt = rows.iterator
  var sentCount = 1
  for (row <- rowIt) {
sentCount = row(29).toString().toInt
for (i <- 0 until row.length) {
  var m = activityMap(i)
  if (activityMap(i) == null) {
m = collection.mutable.Map[String,
Int]().withDefaultValue(0)
  }
  m(row(i).toString()) += 1
  activityMap.update(i, m)
}
  }
  var activityPPRow: Row = Row()
  for((k,v) <- activityMap) {
  var rowVal:String = ""
  for((a,b) <- v) {
rowVal += rowVal + a + ":" + b/sentCount + "|"
  }
  returnValue.update(k, rowVal)
//  activityPPRow.apply(k) = rowVal
  }

}
return returnValue
  }

When I run step 2 I get the following error. I am new to Scala and Spark
and am unable to figure out how to pass the Iterable[Row] to a function and
get back the results.

org.apache.spark.SparkException: Task not serializable
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.map(RDD.scala:317)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:97)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:104)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:106)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:108)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:110)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:112)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:114)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:116)
..


Thanks for the help.

Regards,
Neha Mehta