Hi All, I am new to Spark and I am trying to use forecasting models on time-series data.As per my understanding,the Spark Dataframes are distributed collection of data.This distributed nature can attribute that chunks of data will not be dependent on each other and are possibly treated separately and in parallel manner.
To mitigate this thing for timeseries data and for accurate prediction, i thought instead of making dataframe from large amount of data,i divide it into test and train data in such a way that train and test data will not get distributed among nodes and are treated in one go. If this approach is possible,how can I ensure that data not got distributed and how to approach towards it? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org