[ https://issues.apache.org/jira/browse/SPARK-21515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-21515. ------------------------------- Resolution: Invalid This is a question for StackOverflow or the mailing list. https://spark.apache.org/contributing.html > Spark ML Random Forest > ---------------------- > > Key: SPARK-21515 > URL: https://issues.apache.org/jira/browse/SPARK-21515 > Project: Spark > Issue Type: Question > Components: Build > Affects Versions: 2.1.1 > Reporter: KovvuriSriRamaReddy > > We are reading data from flat file and storing in DataSet<Row>. > We have one for loop, where we need to modify dataset and use it in next > iteration. [ For first iteration we use original DataSet ] > We all know that variable of type Dataset<Row> is immutable. But the scenario > is, inside the for loop we perform some processing(Random Forest Spark ML)on > this variable(Dataset<Type>) and use the updated result in the next > iteration. This process continues until all the iterations are completed. [ > Size of the dataset is same, only values are changing ] > Approach 1: we are storing the Intermediate result in new DataSet variable > and using it in next iteration.What we have observed is, it took only 1sec to > execute for loop 1st iteration and remaining iterations took more time > exponentially. [ i.e 2nd iteration taking 70sec, 3rd iteration taking 90sec > and so on...] > Approach 2: Wrote intermediate DataSet into HDFS/ external file and read > freshly for each iteration from HDFS/File,then each iteration gets completed > more faster then previous approach.However, writing and reading data to/from > HDFS/external file is taking more time. > This is the problem we have which we need to fine tune.Could anyone please > provide a better solution for this issue? > Note: We are unpersisting & assigning NULL value to previous DataSets at the > end of the loop. > Thanks in advance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org