Hi All,

I am new to Spark and I am trying to use forecasting models on time-series
data.As per my understanding,the Spark Dataframes are distributed collection
of data.This distributed nature can attribute that chunks of data will not
be dependent on each other and are possibly treated separately and in
parallel manner.

To mitigate this thing for timeseries data and for accurate prediction, i
thought instead of making dataframe from large amount of data,i divide it
into test and train data in such a way that train and test data will not get
distributed among nodes and are treated in one go.

If this approach is possible,how can I ensure that data not got distributed
and how to approach towards it?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to