Re: How to train and predict in parallel via Spark MLlib?
I put a simple example here: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/3877825096667927/588180/d9d264e39a.html On Thu, Feb 18, 2016 at 6:47 AM Игорь Ляхов wrote: > Xiangrui, thnx for your answer! > Could you clarify some details? > What do you mean "I can trigger training jobs in different threads on the > driver"? I have 4-machine cluster (It will grow in future), and I wish > use them in parallel for training and predicting. > Do you have any example? It will be great if you show me anyone. > > Thanks a lot for your participation! > --Igor > > 2016-02-18 17:24 GMT+03:00 Xiangrui Meng : > >> If you have a big cluster, you can trigger training jobs in different >> threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui >> >> On Thu, Feb 18, 2016, 4:28 AM Igor L. wrote: >> >>> Good day, Spark team! >>> I have to solve regression problem for different restricitons. There is a >>> bunch of criteria and rules for them, I have to build model and make >>> predictions for each, combine all and save. >>> So, now my solution looks like: >>> >>> criteria2Rules: List[(String, Set[String])] >>> var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]()) >>> criteria2Rules.foreach { >>> case (criterion, rules) => >>> val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion, >>> data) >>> val model: GradientBoostedTreesModel = buildModel(trainDataSet) >>> val predictionDataSet = preparePredictionDataSet(criterion, data) >>> val predictedScores = predictScores(predictionDataSet, model, >>> criterion, rules) >>> result = result.union(predictedScores) >>> } >>> >>> It works almost nice, but too slow for the reason >>> GradientBoostedTreesModel >>> training not so fast, especially in case of big amount of features, >>> samples >>> and also quite big list of using criteria. >>> I suppose it could work better, if Spark will train models and make >>> predictions in parallel. >>> >>> I've tried to use a relational way of data operation: >>> >>> val criteria2RulesRdd: RDD[(String, Set[String])] >>> >>> val cartesianCriteriaRules2DataRdd = >>> criteria2RulesRdd.cartesian(dataRdd) >>> cartesianCriteriaRules2DataRdd >>> .aggregateByKey(List[Data]())( >>> { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => >>> lstL >>> ::: lstR} >>> ) >>> .map { >>> case (criteria, rulesSet, scorePredictionDataList) => >>> val trainSet = ??? >>> val model = ??? >>> val predictionSet = ??? >>> val predictedScores = ??? >>> } >>> ... >>> >>> but it inevitably brings to situation when one RDD is produced inside >>> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) >>> and >>> as far as I know it's a bad scenario, e.g. >>> toy example below doesn't work: >>> scala> sc.parallelize(1 to 100).map(x => (x, >>> sc.parallelize(Array(2)).map(_ >>> * 2).collect)).collect. >>> >>> Is there any way to use Spark MLlib in parallel way? >>> >>> Thank u for attention! >>> >>> -- >>> BR, >>> Junior Scala/Python Developer >>> Igor L. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> > > > -- > -- > С уважением, > Игорь Ляхов >
Re: How to train and predict in parallel via Spark MLlib?
Xiangrui, thnx for your answer! Could you clarify some details? What do you mean "I can trigger training jobs in different threads on the driver"? I have 4-machine cluster (It will grow in future), and I wish use them in parallel for training and predicting. Do you have any example? It will be great if you show me anyone. Thanks a lot for your participation! --Igor 2016-02-18 17:24 GMT+03:00 Xiangrui Meng : > If you have a big cluster, you can trigger training jobs in different > threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui > > On Thu, Feb 18, 2016, 4:28 AM Igor L. wrote: > >> Good day, Spark team! >> I have to solve regression problem for different restricitons. There is a >> bunch of criteria and rules for them, I have to build model and make >> predictions for each, combine all and save. >> So, now my solution looks like: >> >> criteria2Rules: List[(String, Set[String])] >> var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]()) >> criteria2Rules.foreach { >> case (criterion, rules) => >> val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion, >> data) >> val model: GradientBoostedTreesModel = buildModel(trainDataSet) >> val predictionDataSet = preparePredictionDataSet(criterion, data) >> val predictedScores = predictScores(predictionDataSet, model, >> criterion, rules) >> result = result.union(predictedScores) >> } >> >> It works almost nice, but too slow for the reason >> GradientBoostedTreesModel >> training not so fast, especially in case of big amount of features, >> samples >> and also quite big list of using criteria. >> I suppose it could work better, if Spark will train models and make >> predictions in parallel. >> >> I've tried to use a relational way of data operation: >> >> val criteria2RulesRdd: RDD[(String, Set[String])] >> >> val cartesianCriteriaRules2DataRdd = >> criteria2RulesRdd.cartesian(dataRdd) >> cartesianCriteriaRules2DataRdd >> .aggregateByKey(List[Data]())( >> { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL >> ::: lstR} >> ) >> .map { >> case (criteria, rulesSet, scorePredictionDataList) => >> val trainSet = ??? >> val model = ??? >> val predictionSet = ??? >> val predictedScores = ??? >> } >> ... >> >> but it inevitably brings to situation when one RDD is produced inside >> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) >> and >> as far as I know it's a bad scenario, e.g. >> toy example below doesn't work: >> scala> sc.parallelize(1 to 100).map(x => (x, >> sc.parallelize(Array(2)).map(_ >> * 2).collect)).collect. >> >> Is there any way to use Spark MLlib in parallel way? >> >> Thank u for attention! >> >> -- >> BR, >> Junior Scala/Python Developer >> Igor L. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> -- -- С уважением, Игорь Ляхов
Re: How to train and predict in parallel via Spark MLlib?
If you have a big cluster, you can trigger training jobs in different threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui On Thu, Feb 18, 2016, 4:28 AM Igor L. wrote: > Good day, Spark team! > I have to solve regression problem for different restricitons. There is a > bunch of criteria and rules for them, I have to build model and make > predictions for each, combine all and save. > So, now my solution looks like: > > criteria2Rules: List[(String, Set[String])] > var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]()) > criteria2Rules.foreach { > case (criterion, rules) => > val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion, > data) > val model: GradientBoostedTreesModel = buildModel(trainDataSet) > val predictionDataSet = preparePredictionDataSet(criterion, data) > val predictedScores = predictScores(predictionDataSet, model, > criterion, rules) > result = result.union(predictedScores) > } > > It works almost nice, but too slow for the reason GradientBoostedTreesModel > training not so fast, especially in case of big amount of features, samples > and also quite big list of using criteria. > I suppose it could work better, if Spark will train models and make > predictions in parallel. > > I've tried to use a relational way of data operation: > > val criteria2RulesRdd: RDD[(String, Set[String])] > > val cartesianCriteriaRules2DataRdd = > criteria2RulesRdd.cartesian(dataRdd) > cartesianCriteriaRules2DataRdd > .aggregateByKey(List[Data]())( > { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL > ::: lstR} > ) > .map { > case (criteria, rulesSet, scorePredictionDataList) => > val trainSet = ??? > val model = ??? > val predictionSet = ??? > val predictedScores = ??? > } > ... > > but it inevitably brings to situation when one RDD is produced inside > another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) and > as far as I know it's a bad scenario, e.g. > toy example below doesn't work: > scala> sc.parallelize(1 to 100).map(x => (x, sc.parallelize(Array(2)).map(_ > * 2).collect)).collect. > > Is there any way to use Spark MLlib in parallel way? > > Thank u for attention! > > -- > BR, > Junior Scala/Python Developer > Igor L. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
How to train and predict in parallel via Spark MLlib?
Good day, Spark team! I have to solve regression problem for different restricitons. There is a bunch of criteria and rules for them, I have to build model and make predictions for each, combine all and save. So, now my solution looks like: criteria2Rules: List[(String, Set[String])] var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]()) criteria2Rules.foreach { case (criterion, rules) => val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion, data) val model: GradientBoostedTreesModel = buildModel(trainDataSet) val predictionDataSet = preparePredictionDataSet(criterion, data) val predictedScores = predictScores(predictionDataSet, model, criterion, rules) result = result.union(predictedScores) } It works almost nice, but too slow for the reason GradientBoostedTreesModel training not so fast, especially in case of big amount of features, samples and also quite big list of using criteria. I suppose it could work better, if Spark will train models and make predictions in parallel. I've tried to use a relational way of data operation: val criteria2RulesRdd: RDD[(String, Set[String])] val cartesianCriteriaRules2DataRdd = criteria2RulesRdd.cartesian(dataRdd) cartesianCriteriaRules2DataRdd .aggregateByKey(List[Data]())( { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL ::: lstR} ) .map { case (criteria, rulesSet, scorePredictionDataList) => val trainSet = ??? val model = ??? val predictionSet = ??? val predictedScores = ??? } ... but it inevitably brings to situation when one RDD is produced inside another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) and as far as I know it's a bad scenario, e.g. toy example below doesn't work: scala> sc.parallelize(1 to 100).map(x => (x, sc.parallelize(Array(2)).map(_ * 2).collect)).collect. Is there any way to use Spark MLlib in parallel way? Thank u for attention! -- BR, Junior Scala/Python Developer Igor L. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org