[ 
https://issues.apache.org/jira/browse/SPARK-21515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21515.
-------------------------------
    Resolution: Invalid

This is a question for StackOverflow or the mailing list.
https://spark.apache.org/contributing.html

> Spark ML Random Forest
> ----------------------
>
>                 Key: SPARK-21515
>                 URL: https://issues.apache.org/jira/browse/SPARK-21515
>             Project: Spark
>          Issue Type: Question
>          Components: Build
>    Affects Versions: 2.1.1
>            Reporter: KovvuriSriRamaReddy
>
> We are reading data from flat file and storing in DataSet<Row>.
> We have one for loop, where we need to modify dataset and use it in next 
> iteration. [ For first iteration we use original DataSet ] 
> We all know that variable of type Dataset<Row> is immutable. But the scenario 
> is, inside the for loop we perform some processing(Random Forest Spark ML)on 
> this variable(Dataset<Type>) and use the updated result in the next 
> iteration. This process continues until all the iterations are completed. [ 
> Size of the dataset is same, only values are changing ]
> Approach 1: we are storing the Intermediate result in new DataSet variable 
> and using it in next iteration.What we have observed is, it took only 1sec to 
> execute for loop 1st iteration and remaining iterations took more time 
> exponentially. [ i.e 2nd iteration taking 70sec, 3rd iteration taking 90sec 
> and so on...]
> Approach 2: Wrote intermediate DataSet into HDFS/ external file and read 
> freshly for each iteration from HDFS/File,then each iteration gets completed 
> more faster then previous approach.However, writing and reading data to/from 
> HDFS/external file is taking more time.
> This is the problem we have which we need to fine tune.Could anyone please 
> provide a better solution for this issue?
> Note: We are unpersisting & assigning NULL value to previous DataSets at the 
> end of the loop.
> Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to