Yes, that would be a suitable option. We could just extend the standard Spark MLLib Transformer and add the required meta-data.

Just out of curiosity: Is there a specific reason for why the user of a standard Transform would not be able to add arbitrary key-value pairs for additional meta-data? This could also be handy not just for things like versioning, but also for storing evaluation metrics together with a trained pipeline (for people who aren't using something like MLFlow, yet).

Cheers,

Martin

Am 2021-10-25 14:38, schrieb Sean Owen:

You can write a custom Transformer or Estimator?

On Mon, Oct 25, 2021 at 7:37 AM Sonal Goyal <sonalgoy...@gmail.com> wrote:
Hi Martin,

Agree, if you don't need the other features of MLFlow then it is likely overkill.

Cheers,
Sonal
https://github.com/zinggAI/zingg

On Mon, Oct 25, 2021 at 4:06 PM <mar...@wunderlich.com> wrote:

Hi Sonal,

Thanks a lot for this suggestion. I presume it might indeed be possible to use MLFlow for this purpose, but at present it seems a bit too much to introduce another framework only for storing arbitrary meta-data with trained ML pipelines. I was hoping there might be a way to do this natively in Spark ML. Otherwise, I'll just create a wrapper class for the trained models.

Cheers,

Martin

Am 2021-10-24 21:16, schrieb Sonal Goyal:

Does MLFlow help you? https://mlflow.org/

I don't know if ML flow can save arbitrary key-value pairs and associate them with a model, but versioning and evaluation etc are supported.

Cheers,
Sonal
https://github.com/zinggAI/zingg

On Wed, Oct 20, 2021 at 12:59 PM <mar...@wunderlich.com> wrote:

Hello,

This is my first post to this list, so I hope I won't violate any (un)written rules.

I recently started working with SparkNLP for a larger project. SparkNLP in turn is based Apache Spark's MLlib. One thing I found missing is the ability to store custom parameters in a Spark pipeline. It seems only certain pre-configured parameter values are allowed (e.g. "stages" for the Pipeline class).

IMHO, it would be handy to be able to store custom parameters, e.g. for model versions or other meta-data, so that these parameters are stored with a trained pipeline, for instance. This could also be used to include evaluation results, such as accuracy, with trained ML models.

(I also asked this on Stackoverflow, but didn't get a response, yet: https://stackoverflow.com/questions/69627820/setting-custom-parameters-for-a-spark-mllib-pipeline)

Would does the community think about this proposal? Has it been discussed before perhaps? Any thoughts?

Cheers,

Martin

Reply via email to