Model metadata (mostly parameter values) are usually tiny. The parquet data is most often for model coefficients. So this depends on the size of your model, i.e. Your feature dimension.
A high-dimensional linear model can be quite large - but still typically easy to fit into main memory on a single node. A high-dimensional multi-layer perceptron with many layers could be quite a lot larger. An ALS model with millions of users &I items could be quite huge. On Thu, 18 Aug 2016 at 18:00, Rich Tarro <richta...@gmail.com> wrote: > The following Databricks blog on Model Persistence states "Internally, we > save the model metadata and parameters as JSON and the data as Parquet." > > > https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html > > > What data associated with a model or Pipeline is actually saved (in > Parquet format)? > > What factors determine how large the the saved model or pipeline will be? > > Thanks. > Rich >