Hi all, I'm not sure if this belongs here in users or over in dev as I guess it's somewhere in between. We have been starting to implement some machine learning pipelines, and it seemed from the documentation that Spark had a fairly well thought-out platform (see: http://spark.apache.org/docs/1.6.1/ml-guide.html )
I liked the design of Transformers, Models, Estimators, Pipelines, etc. However as soon as we began attempting to code our first ones, we began running into one class or method after another that has been marked private... Some examples are: - SchemaUtils - (for validating schemas passed in and out, and adding output columns to DataFrames) - Loader / Saveable (traits / helpers for saving and loading models) - Several classes under 'collection' namespace like OpenHashSet / OpenHashMap - All of the underlying linear algebra Breeze details - Other classes specific to certain models. We are writing an alternative LDA Optimizer / Trainer and everything under LDAUtils is private. I'd like to ask what the expected approach is here. I see a few options, none of which seem appropriate: 1. Implement everything in the org.apache.spark.* namespaces to match package privates - will this even work in our own modules ? - we would be open to contributing some of our code back but not sure the project wants it 2. Implement our own versions of all of these things. - lots of extra work for us, leads to unseen gotchas in implementations and other unforseen issues 3. Copy classes into our namespace for use - duplicates code, leads to code diversion as the main code is kept up to date. Thanks in advance for any recommendations on this frustrating issue. Thunder